Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.
search cancel

Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.

book

Article ID: 373329

calendar_today

Updated On:

Products

VMware vSphere with Tanzu

Issue/Introduction

1. The supervisor cluster status shows "configuring" in vSphere UI under workload management.

2. All the pods and service inside the supervisor are up and running.

3. On running the command "kubectl get ns", the following namespaces: vmware-system-capw, vmware-system-pkgs, vmware-system-tkg and vmware-system-ucs are stuck in "terminating" state.

4. No finalizers are associated to any API resources associated with these namespaces and neither any resource is stuck in "deleting".

5. No HTTP or HTTPS proxy is configured for the supervisor or the vCenter server itself.

6. On describing the namespace stuck in "terminating" it complains of not being able to call out webhooks as below

NamespaceDeletionContentFailure              True      ContentDeletionFailed   Failed to delete all resource types, 4 remaining: Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "utkg.clusterclass.validating.clusterclass.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/utkg-clusterclass-validate-cluster-x-k8s-io-v1beta1-clusterclass?timeout=10s": service "vmware-system-tkg-webhook-service" not found.

 

7. Per wcp, it is unable to see the supervisor service package in packageInstalls which stalls the upgrade and the supervisor stays stuck in "configuring"

error wcp [controller/core_service_controller.go:585] [opID=CoreServiceController] error registering core services: error creating spec for registering core service 'sample-pkg.test.carvel.dev': open /etc/vmware/wcp/supervisorservices/packages/sample-pkg.test.carvel.dev-1.0.0.yaml: no such file or directory
info wcp [controller/core_service_controller.go:141] [opID=CoreServiceController] Reconciling core services on all Supervisors
debug wcp [kubelifecycle/kube_instance_grouped_conditions.go:351] [opID=CoreServiceController] No Core Services found to set condition.
debug wcp [controller/image_registry_controller.go:80] [opID=ContainerImageRegistryController] synchronizing Container Image Registries to all Supervisors
debug wcp [logger/trace.go:77] [opID=ContainerImageRegistryController] [BEGIN] [supervisor/controller.(*ContainerImageRegistryController).syncImageRegistriesToSupervisors:130] synchronizing Container Image Registries onto Supervisor clusters

Environment

vSphere with Tanzu 8.0

VMware vCenter server 8.0.x

Cause

During RDU upgrades of vCenter, the configuration files on VCSA from the source vCenter get copied onto the new target vCenter, overwriting the correct values from core-services.json and the Supervisor Services allow-list.txt. As a result, the wcpsvc is repeatedly searching for "sample-pkg.test.carvel.dev" which is an unreleased ID present as a placeholder in previous VC releases prior to the Core Services feature being enabled. 

Resolution

There is no permanent fix for now. Meanwhile the following workaround is available.

1. Overwrite all the files and folders inside the /etc/vmware/wcp/supervisorservices folder in the existing vCenter server with the files and folders from the target VC environment.

2. Overwrite the file /etc/vmware/wcp/supervisor-services-allow-list.txt in the existing vCenter server with the one from the target VC environment.

Post this restart the wcp service on the affected vCenter server.

Additional Information