Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.
search cancel

Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.

book

Article ID: 373329

calendar_today

Updated On:

Products

VMware vSphere with Tanzu

Issue/Introduction

1. The supervisor cluster status shows "configuring" in vSphere UI under workload management.

2. All the pods and service inside the supervisor are up and running.

3. On running the command "kubectl get ns", the following namespaces: vmware-system-capw, vmware-system-pkgs, vmware-system-tkg and vmware-system-ucs are stuck in "terminating" state. 
(Note: They will not be in deleting state if its a fresh deployment) 

4. No finalizers are associated to any API resources associated with these namespaces and neither any resource is stuck in "deleting".
(Note: They will not be in deleting state if its a fresh deployment) 

5. No HTTP or HTTPS proxy is configured for the supervisor or the vCenter server itself.

6. On describing the namespace stuck in "terminating" it complains of not being able to call out webhooks as below
(Note: They will not be in deleting state if its a fresh deployment) 

NamespaceDeletionContentFailure              True      ContentDeletionFailed   Failed to delete all resource types, 4 remaining: Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "utkg.clusterclass.validating.clusterclass.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/utkg-clusterclass-validate-cluster-x-k8s-io-v1beta1-clusterclass?timeout=10s": service "vmware-system-tkg-webhook-service" not found.

 

7. Per wcp, it is unable to see the supervisor service package in packageInstalls which stalls the upgrade and the supervisor stays stuck in "configuring"

error wcp [controller/core_service_controller.go:585] [opID=CoreServiceController] error registering core services: error creating spec for registering core service 'sample-pkg.test.carvel.dev': open /etc/vmware/wcp/supervisorservices/packages/sample-pkg.test.carvel.dev-1.0.0.yaml: no such file or directory
info wcp [controller/core_service_controller.go:141] [opID=CoreServiceController] Reconciling core services on all Supervisors
debug wcp [kubelifecycle/kube_instance_grouped_conditions.go:351] [opID=CoreServiceController] No Core Services found to set condition.
debug wcp [controller/image_registry_controller.go:80] [opID=ContainerImageRegistryController] synchronizing Container Image Registries to all Supervisors
debug wcp [logger/trace.go:77] [opID=ContainerImageRegistryController] [BEGIN] [supervisor/controller.(*ContainerImageRegistryController).syncImageRegistriesToSupervisors:130] synchronizing Container Image Registries onto Supervisor clusters

Environment

vSphere with Tanzu 8.0
VMware vCenter server 8.0.x

Cause

During RDU upgrades of vCenter to version 8.0U3 or 8.0U3a, the configuration files on VCSA from the source vCenter get copied onto the new target vCenter, overwriting the correct values from core-services.json and the Supervisor Services allow-list.txt. As a result, the wcpsvc is repeatedly searching for "sample-pkg.test.carvel.dev" which is an unreleased ID present as a placeholder in previous VC releases prior to the Core Services feature being enabled. 

Resolution

Issue is fixed in U3b. 

The following workaround is also available. If you are not on 8.0U3 (24022515) or 8.0U3a (24091160)  do NOT run this workaround as the files below are specific to those two versions only. 

1. Replace all text in /etc/vmware/wcp/supervisorservices/core-services.json

{
  "services": {
    "tkg.vsphere.vmware.com": {
      "versions": [
        {
          "content_type": "CARVEL_APPS_YAML",
          "content_file": "/etc/vmware/wcp/supervisorservices/packages/tkg-package.yaml",
          "yaml_service_config_file": ""
        }
      ],
      "install_by_default": true,
      "install_on_nonpodvm_supervisor": true
    },
    "velero.vsphere.vmware.com": {
      "versions": [
        {
          "content_type": "CARVEL_APPS_YAML",
          "content_file": "/etc/vmware/wcp/supervisorservices/packages/velero-package.yaml",
          "yaml_service_config_file": ""
        }
      ],
      "install_by_default": true,
      "migrate_from": "velero-vsphere",
      "install_on_nonpodvm_supervisor": false
    }
  }
}

 

2. Replace all text in /etc/vmware/wcp/supervisor-services-allow-list.txt with 

# List of SupervisorService IDs allowed to be created
# if "allow_all_services" in the service config is turned off.
# The following IDs correspond to the PSP services already shipped.

# Minio
minio

# Cloudian
hyperstore

# Velero Services (vDPP and newer version)
velero-vsphere
velero.vsphere.vmware.com

# ECS Objectscale
objectscale

# Sample service ID
sample

# Argo CD
argo-cd

# CA Cluster Issuer
ca-clusterissuer.vsphere.vmware.com

# Harbor from the TKG packages
# See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/harbor/2.5.3/upstream-package.yaml
harbor.tanzu.vmware.com

# Contour from the TKG packages
# See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/contour/1.18.2/upstream-package.yaml
contour.tanzu.vmware.com

# External DNS from the TKG packages
# See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/external-dns/upstream-metadata.yaml
external-dns.tanzu.vmware.com

# Wildcard pattern for allowing Flings. The following line permits services named as "service1.fling.vsphere.vmware.com" or "my-service.fling.vsphere.vmware.com".
*.fling.vsphere.vmware.com

# TKG Supervisor Service
tkg.vsphere.vmware.com

# NSX Management Proxy Supervisor Service
nsx-management-proxy.nsx.vmware.com

# CCI NS Supervisor Service
cci-ns.vmware.com

 

3. Restart the wcp service on the affected vCenter server.

vmon-cli -r wcp

Additional Information