TKGI cluster update and Bosh tile Apply Changes fail after Availability Zone removal
search cancel

TKGI cluster update and Bosh tile Apply Changes fail after Availability Zone removal

book

Article ID: 437404

calendar_today

Updated On:

Products

Operations Manager VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  1. After removal of Resource Pools from vSphere, and removal from Bosh tile using reference KB How to add or modify my availability zones without the need to reinstall Tanzu Application Services
    • AND, after removal of Availability Zones from Bosh director using the above KB for reference.
    • Apply Changes was run on Bosh Director tile as well as TKGI tile.
  2. Now, update attempts on TKGI clusters to add more worker nodes using 'tkgi update-cluster <CLUSTER_NAME> --num-nodes #' fail with the errors like:

    Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: ########-####-####-####-025b71431500, broker-request-id: ########-####-####-####-fc770262cb8a, task-id: 184660776, operation: update, error-message: Network 'pks-########-####-####-####-9c0952c88ace' refers to an unknown availability zone 'AZ2'
     
  3. Additionally, attempts to Apply Changes on the Bosh Director tile from Opsman now fail when attempting to verify Bosh Director health.
    • From an SSH into the Bosh Director, running 'sudo monit summary', you see the metrics-server service is not in Running state.
    • Logs for metrics-server in /var/vcap/sys/log/director/metrics-server.stderr.log report errors like:

      `check_validity_of_subnet_availability_zone': Network 'pks-########-####-####-####-9c0952c88ace' refers to an unknown availability zone 'AZ2' (Bosh::Director::NetworkSubnetUnknownAvailabilityZone)

Environment

TKGI version is not relevant to this failure condition.

Cause

This issue occurs due to a synchronization mismatch between the Global BOSH Cloud Config and the Brokered Cluster Configurations managed by TKGI.

In a TKGI environment, BOSH uses a layered configuration model. When an Availability Zone (AZ) is removed from the BOSH Director tile in Opsman, it is deleted from the Global Cloud Config. However, TKGI manages individual Kubernetes clusters as independent BOSH deployments, each possessing its own Named Cloud Config and Deployment Manifest.

If an existing cluster specifically the apply-addons errand or worker node pools still references the deleted AZ, the BOSH Director will fail validation. This creates a "deadlock" state:

  • The Director cannot complete Apply Changes from Opsman because it cannot validate existing deployments against a Global Cloud Config that is missing their assigned AZs.
  • The metrics-server on the Director fails to start because it cannot reconcile the stale network-to-AZ mappings stored in the BOSH database for the managed clusters.

Resolution

To resolve this, you must temporarily restore the deleted AZs to the BOSH Director to allow the stale manifests to be updated and redeployed.

 

Step 1: Restore AZ References in BOSH Cloud Config

The BOSH Director must recognize the missing AZ names to process any manifest updates.

  1. Export the current cloud config to two files (one to be edited, one backup):

    bosh -e <env> cloud-config > bosh_cc_new.yml
    bosh -e <env> cloud-config > bosh_cc_orig.yml

  2. Edit bosh_cc_new.yml to re-add entries for the missing AZs (e.g., AZ2, AZ3). Point these to valid existing vSphere resources (you can alias them to your current active AZ's resources).

    NOTE: This requires editing the az well as the network section in the cloud-config. Applying the cleaned cluster manifests in Step 2.3 will report where in the global cloud-config these values need to be edited. This can be used for reference if needed.

  3. Update the config:

    bosh -e <env> update-cloud-config bosh_cc_new.yml


Step 2: Manually Clean Cluster Manifests

  1. Identify all clusters referencing the old AZs by downloading their manifests:

    bosh -d service-instance_<ID> manifest > service-instance_<ID>_manifest.yml

  2. Search and replace all references of the old AZs (e.g., AZ2, AZ3) with the new/active AZs (e.g., AZ4, AZ5). Pay close attention to the apply-addons instance group.

  3. Deploy the corrected manifest:

    bosh -d service-instance_<ID> deploy service-instance_<ID>_manifest.yml


Step 3: Synchronize TKGI Control Plane

Updating the BOSH manifest manually does not update the TKGI database. You must trigger a sync to update the Named Cloud Configs managed by the TKGI broker.

  1. For each corrected cluster, run:

    tkgi update-cluster <CLUSTER_NAME> --num-nodes <unchanged_current_count>

  2. If the cluster utilizes compute profiles, include those in the command:

    tkgi update-cluster <CLUSTER_NAME> --compute-profile <PROFILE_NAME> --node-pool-instances "<POOL_NAME>:<COUNT>"


Step 4: Final Cleanup and Director Update

  1. Once all clusters are updated and no longer reference the old AZs, revert the Global Cloud Config to the desired state (removing the temporary AZ2/AZ3 references):

    bosh -e <env> update-cloud-config bosh_cc_orig.yml

  2. Return to the Ops Manager GUI.

  3. Run Apply Changes on the Bosh Director tile. The metrics-server should now validate successfully as all cluster-specific network references in the BOSH database have been updated to active AZs.

Additional Information

The Dependency Hierarchy

In TKGI, the BOSH Director doesn't just manage one big system; it manages a fleet of independent Kubernetes clusters. Each layer must be valid for the layer above it to function.

  • Global Cloud Config (The Foundation): Defines the physical world (AZs, Networks, VM Types). If an AZ is removed here, the Director "forgets" it exists.
  • Named Cloud Configs (The Scaffold): TKGI creates a specific config for every cluster (e.g., pks-CLUSTER-ID). These map the cluster's logical requirements to the Global Cloud Config.
  • Cluster Manifest (The Blueprint): Defines the actual VMs (Master/Worker nodes) and Errands (Apply-Addons) for that specific Kubernetes instance.

 

The TKGI update-cluster Workflow

When you execute tkgi update-cluster, you are triggering a synchronized orchestration between the TKGI API (the Broker) and the BOSH Director. This process ensures that the "Blueprint" (Manifest) and the "Scaffold" (Named Cloud Config) are updated simultaneously.

  1. The TKGI API Layer (The Request)
    The TKGI API receives your command (e.g., --num-nodes 5). It queries its internal database to retrieve the Plan associated with that cluster.
    • Action: The API calculates the new resource requirements.
    • Result: It generates a new version of the Cluster Manifest and a new Named Cloud Config.

  2. Changes in the "Named Cloud Config"
    This is the most critical and often overlooked part of the TKGI workflow. Unlike standard BOSH deployments, TKGI uses Named Configs to provide isolation for each cluster's networking.
    • Location: You can see this via:

      bosh configs | egrep "pivotal-container-service|pks"

    • The Change: Subnet Mapping: If you have modified AZs or network ranges in the plan, the subnets block within this named config is rewritten.
      • AZ Binding: It explicitly maps the Kubernetes cluster’s network name to the Global AZs.
      • Example: If you move a cluster from AZ2 to AZ4, the Named Cloud Config is updated to point the pks-network to AZ4 before the VM deployment begins.

  3. Changes in the Cluster Manifest
    Once the Named Cloud Config is updated, the TKGI API pushes the updated Deployment Manifest to the BOSH Director.
    • Instance Groups: The instances: count for the worker job is updated (e.g., from 3 to 5).
    • AZ Placement: The azs: list for the instance groups (Master, Worker, and Errands) is updated to match the new plan configuration.
    • Version Increment: The manifest version is incremented, and BOSH prepares a "Diff" to show which VMs will be created, changed, or deleted.

  4. Interactions with the "Global Cloud Config"
    The Global Cloud Config (managed via the BOSH Director Tile in Ops Manager) is the only layer that remains unchanged during a tkgi update-cluster.
    • The Dependency: The Global Cloud Config acts as the "Source of Truth." During the update, the BOSH Director validates the Named Cloud Config against the Global Cloud Config.
    • The Failure Point: If your update-cluster command references AZ4, but AZ4 is not defined in the Global Cloud Config, the Director will reject the update immediately.

 

The End-to-End Execution Flow

  1. Validation: BOSH confirms the Manifest (what you want) is compatible with the Named Config (where it goes), which is in turn compatible with the Global Config (what is physically available).
  2. Compilation: If the update involves a new Kubernetes version or stemcell, BOSH compiles the necessary packages.
  3. Update Strategy: BOSH performs a "rolling update."
    • It creates new Worker VMs in the newly specified AZs.
    • It deletes old Worker VMs from the decommissioned AZs.
  4. Errand Execution: After the VMs are stable, BOSH runs the apply-addons errand. This errand uses the updated AZ mapping to spin up its temporary VM, ensuring the Kubernetes CNI and Storage Classes are aware of the new AZ placement.