VKS Cluster Management fails for existing projects after upgrading VCF Automation to 9.0.1
search cancel

VKS Cluster Management fails for existing projects after upgrading VCF Automation to 9.0.1

book

Article ID: 413831

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

After upgrading VMware Cloud Foundation (VCF) Automation to version 9.0.1, you may encounter an issue where existing vSphere Kubernetes Service (VKS) projects and clusters are no longer visible in the VCF Automation UI. The Kubernetes management page appears empty, preventing any cluster management operations for those projects.

This behavior is caused by a stale project entry referencing an organization that no longer exists, which prevents the project service from loading correctly.

Environment

VMware Cloud Foundation (VCF) 9.0.1

Cause

The root cause is a stale project record in the project_db database. This record contains an org_id that is no longer present in the tenant-manager database. When the project-service-app attempts to load project details, it fails with a Cannot load details from TM for org with id error, which results in a 500 error and prevents any projects from being displayed in the UI.

Resolution

This issue is resolved in VCF Automation versions 9.0.2 and 9.1. For older versions, the following workaround can be used.


Workaround

Part 1: Identify the Stale Project

First, confirm that the issue is caused by a non-existent organization ID referenced in the project service.

  1. SSH to the VCF Automation appliance and switch to the root user.
    sudo su -
  2. Export the KUBECONFIG environment variable to interact with the Kubernetes cluster.
    export KUBECONFIG=/etc/kubernetes/admin.conf
  3. Restart the resource-manager pod to trigger the service startup logic and generate the relevant error.
    • First, find the pod name:
      kubectl get pods -n prelude | grep resource-manager
    • Then, delete the pod (it will be recreated automatically):
      kubectl delete pod -n prelude <resource-manager-pod-name>
  4. Check the project-service-app log for the specific error. The presence of this error confirms the issue. Take note of the org_id from the error message.
    grep -i 'Cannot load details' /var/log/services-logs/prelude/project-service-app/file-logs/project-service-app.log
    Example Error:
    Cannot load details from TM for org with id: db6fd39f-0160-4819-9b9f-3b0a76253692

Part 2: Remove the Stale Database Entry

Warning: Manual Database Modification

The following procedure involves direct modification of the product database. Proceed with caution.

While still logged into the appliance as root with the KUBECONFIG variable set:

  1. Exec into the PostgreSQL pod.
    kubectl exec -i -t -n prelude vcfapostgres-0 -- /bin/bash
  2. Connect to the tenant-manager database to verify which organizations currently exist.
    psql -U postgres tenantmanager
  3. List the organizations and confirm that the org_id from the log error is not in this list.
    SELECT * from organization;
  4. Exit the tenant-manager database by typing \q and pressing Enter.
  5. Connect to the project_db database.
    psql -U postgres project_db
  6. List all projects to identify the entry with the stale org_id.
    SELECT * from project;
  7. Delete the row corresponding to the project with the non-existent org_id. Be sure to replace <org_id_to_delete> with the actual ID from the log error.
    DELETE FROM project WHERE org_id = '<org_id_to_delete>';
  8. Exit the database shell (\q) and then exit the pod shell (exit).
  9. Finally, restart the resource-manager pod one last time and verify that it starts correctly without errors.
    kubectl delete pod -n prelude <resource-manager-pod-name>