Unable to provision new nodes for VMSP cluster scaling and lifecycle operations
search cancel

Unable to provision new nodes for VMSP cluster scaling and lifecycle operations

book

Article ID: 434922

calendar_today

Updated On:

Products

VCF Operations VMware Cloud Foundation

Issue/Introduction

Day 2 action failures:

  • Day 2 operation that entails creation or replacement of a node (e.g. scale-up, add node, resize nodes, change machine type) may fail or stay in "running" or "pending" state.
  • In the VCF UI, LCM or the API task/response for the triggered operation, you may find below errors:

The VM template could not be found
The template was not found at the expected path
Cloning the virtual machine failed
Unable to clone from template
Source virtual machine or template not found

vCenter Events:

  • Recent Tasks:  A failed "Clone virtual machine" (or similar) task with a reason such as source template not found, virtual machine not found, or invalid state may be seen when the platform is trying to create a new node
  • Events: On the cluster, resource pool, or the folder where the VMSP cluster VMs and templates live, you may see events about failed clone operations or a missing/invalid source VM or template
  • VM's and Templates View: In the folder where the VMSP cluster was deployed, the template vcf-services-runtime-template-<version>.<ob-number> may be missing or corrupted it may show (orphaned) or (inaccessible)

Environment

  • VCF Operations 9.1
  • VMware Cloud Foundation 9.1

Cause

The VCF Management Service Platform (VMSP) utilizes vcf-services-runtime-template-<version>.<ob-number> VM template in vCenter to provision both control plane and worker nodes.

If the VM template is deleted, moved, or corrupted, any downstream provisioning task requiring a new node creation—such as scale-up operation will fail until the template is restored.

Impact:

  • No impact on existing nodes; Already-running control plane and worker nodes continue to operate.
  • New node creation fails. Cluster API (CAPI)  cannot clone from the missing or invalid template.
  • Day-N operations such as scale-up, node rollout (e.g. disk size change, machine type change), and replacing failed nodes will fail.

Resolution

  • Re-populate the VMSP VM template in vCenter via the Staging API flow.

This triggers a fresh synchronization of the component from the repository to the vCenter environment.

Prerequisites

  • Access: Network connectivity to the VMSP Platform Gateway (Management API).
  • Repository: The depot manifest URL must be accessible from the VMSP cluster.
  • Authentication: The same administrative credentials used during cluster bring up in VCF Installer / VCF Ops.

Procedure

Step 1: Remove the existing template in vCenter (if present)

Overwriting the corrupted or existing template is NOT supported by the Staging API. So, if a template with the same name already exists in vCenter, you must delete or move it before running the Stage API.

  • In vSphere Client under 'VMs and Templates':
  • Go to the folder where the VMSP cluster was deployed (the same folder configured as the VM/template folder during cluster bringup in VCF Installer / VCF Ops)
  • Find a VM template named 'vcf-services-runtime-template-<version>.<ob-number>'

Step 2: Obtain a Management API Token

Obtain an authentication token using the Platform FQDN:

curl --request POST \
--url https://<platform-fqdn>/api/v1/identity/token \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data grant_type=password \
--data [email protected] \
--data password=<psswd>

Note:

  • Use the same FQDN and credentials that were configured during cluster bring up in VCF Installer / VCF Ops:
  • Platform Gateway FQDN: This is the same value configured as "Platform gateway FQDN" (or "Platform FQDN") when you brought up the cluster in VCF Installer or VCF Ops. It is the hostname used to reach the VMSP Management API over HTTPS (port 443).
  • Username: [email protected] (default admin account).
  • Password: The administrative (system) password you set during cluster bring up in VCF Installer / VCF Ops.

Step 3: Invoke the Staging API

1. Obtain the depot manifest URL

The depot manifest is a YAML file that defines the VMSP platform package (including the VM template). It must be hosted on an accessible HTTP(S) repository (URL typically provided via the customer's release process) so the cluster can retrieve configuration data during its lifecycle operations.

The depot manifest URL typically follows a standardized naming convention:

<base-url>/<path>/depot-manifest-vmsp-platform-<version>.<ob-number>.yaml

  • The base-url is typically the 'Fleet gateway FQDN' or 'Fleet Depot Service' endpoint, the same value configured as "Fleet gateway FQDN" or "Fleet Depot Service Endpoint" (or equivalent) in VCF Installer / VCF Ops.
  • path: The path to the depot manifest (e.g. /depot-service/content-gateway/PROD/COMP/vmsp-platform/)
  • version:         The specific release version of the platform (e.g. 9.1.0.0).
  • ob-number: The official build number associated with the release (e.g. 25234230)

Example:

https://<fleet-fqdn>/depot-service/content-gateway/PROD/COMP/vmsp-platform/depot-manifest-vmsp-platform-9.1.0.0.25234230.yaml

2.Invoke Staging API

  • Trigger the staging process by calling the Staging API and pointing to the platform depot manifest (use the token obtained in Step 2)

curl -ks -X POST "https://<platform-fqdn>/api/v1/components?action=stage" \
--header "Authorization: Bearer <token>" \
--header "Content-Type: application/json" \
-d '{"repository":{"url":"<depot-manifest-url>"}}'

Step 4: Monitor Task Progress

The Stage API will return a Task ID. Monitor the task until the status reaches SUCCEEDED.

curl -ks --request GET \
  --url "https://<platform-fqdn>/api/v1/tasks/<task-id>" \
  --header "Authorization: Bearer <token>"

Task status returns "Failed"

When the task status returns "Failed", follow these steps to identify the cause and resolve the issue.

1. Check for Precheck failures (validation errors)

  • Locate the Error: If precheckGroups is non-empty, drill down into the prechecks array to find the entry where "status": "FAILED".
  • Identify the Cause: Review issue.message.default (or localized) for a high-level description (e.g., URL not found).
    • Check issue.message.args for specific variables (e.g., the exact URL or HTTP error code).
  • Find the Fix: Use resolution.default (or localized) for the specific recommended action.
  • Examples of precheck failures:
    • URL not reachable (connection/timeout or HTTP status other than 200/3xx/404).
    • URL not found (HTTP 404).

2. Check for Workflow Stage Failures

  • If the prechecks passed but the task ultimately fails, the issue is likely occurring during the execution phase. To isolate the cause, inspect the stages array:
    • Find the failed step: Identify the stage where "status": "FAILED".
    • Contextualize: Use the name and description fields to determine exactly where the process halted (e.g., "VCF Component Stage Initialization").
  • Examples of workflow stage failures:
    • Failed to upload VM template to vCenter
    • Bundle did not transition into Pushed/Successful state in time
    • Component staging operation failed

3. Resolution

  • Remediate: Apply the fix suggested in the resolution field (for precheck failures) or address the cause (for workflow failures). Common reasons the Stage API or template restoration can fail:
    • Template already exists in vCenter: The API does not overwrite an existing template. Delete or move the existing template in vCenter (see Step 1).
    • Incorrect or unreachable depot manifest URL: The repository.url is wrong, returns 404, or is not reachable from the cluster. Verify the URL (and how to construct it (see Step 3).
    • vCenter permissions or connectivity: The cluster cannot upload the OVA to vCenter (permissions, network, or invalid datastore/folder). Verify vCenter credentials and the template folder/datastore configured for the cluster.
    • Timeout: The staging or template upload took too long. Retry with a longer timeout, e.g. add "options": {"timeout": "2h"} to the Stage API request body.
    • Invalid or corrupted depot manifest/package: The manifest or package at the URL is not valid for VMSP platform. Use the correct manifest from your release process.
  • Retry: Re-invoke the Stage API using the same (or corrected) payload.

If the failure persists, contact Broadcom support and provide the below data:

  • task id
  • The failed precheckGroups or stages objects.
  • The specific error message and resolution text.

Step 5: Verify Template Restoration

Once the task is complete, perform a final validation:

  • Log in to the vSphere Client.
  • Navigate to the VMs and Templates inventory view.
  • Locate the VM template (e.g., vcf-services-runtime-template-<version>.<ob-number>).
  • Confirm the template is present, registered, and does not show a status of (orphaned) or (inaccessible).

Step 6: Set the template MoID in the PD using the Configure API (if needed)

After the template has been restored in vCenter, the new MoID may need to be updated in the PackageDeployment (PD).

1.Get component ID

curl --request GET \
--url https://<platform-fqdn>/api/v1/components \
--header "Authorization: Bearer <token>" \
| jq -r '.components[] | select(.name == "vsp" or .name == "vmsp-platform") | .id'

2. Get current configuration

curl --request GET \
--url https://<platform-fqdn>/api/v1/components/<component-id> \
--header "Authorization: Bearer <token>" \
| jq '.spec.configuration.infrastructure.vsphere.templateId'

e.g. "VirtualMachine:vm-55"

3.Check if the template MoID is in sync

  • Compare the value returned (e.g. 'VirtualMachine:vm-<id>') with the MoID of the restored template in vCenter.
  • If they match, the template MoID is already in sync and no update is needed.
  • If they differ, or the configuration has no template MoID and you have a restored template, proceed with the update steps below.

4.If needed, apply the template MoID with the Configure API

curl --request POST \
  --url https://<platform-fqdn>/api/v1/components/<component-id>?action=apply' \
  --header "Authorization: Bearer <token>" \
  --header 'Content-Type: application/json' \
  --data '{
    "spec": {
        "configuration": {
            "infrastructure": {
                "vsphere": {
                    "templateId": "VirtualMachine:<moid>"
                }
            }
        }
    }
}'

  • Monitor the task until it completes (See: 'Step 4: Monitor Task Progress')
  • Get the updated configuration (See: 'Get current configuration'), and verify it now references the updated MoID