Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.
search cancel

Post upgrade, Supervisor stuck in "Configuring" because it cannot configure the Core Supervisor Services.

book

Article ID: 373329

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  1. The supervisor cluster status shows "configuring" in vSphere UI under workload management.
  2. All the pods and service inside the supervisor are up and running.
  3. On running the command "kubectl get ns", the following namespaces: vmware-system-capw, vmware-system-pkgs, vmware-system-tkg and vmware-system-ucs are stuck in "terminating" state. 
    (Note: They will not be in deleting state if its a fresh deployment) 
  4. No finalizers are associated to any API resources associated with these namespaces and neither any resource is stuck in "deleting".
    (Note: They will not be in deleting state if its a fresh deployment) 
  5. No HTTP or HTTPS proxy is configured for the supervisor or the vCenter server itself.
  6. On describing the namespace stuck in "terminating" it complains of not being able to call out webhooks as below
    (Note: They will not be in deleting state if its a fresh deployment) 
    NamespaceDeletionContentFailure              True      ContentDeletionFailed   Failed to delete all resource types, 4 remaining: Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "capi.validating.tanzukubernetescluster.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/capi-validate?timeout=10s": service "vmware-system-tkg-webhook-service" not found, Internal error occurred: failed calling webhook "utkg.clusterclass.validating.clusterclass.run.tanzu.vmware.com": failed to call webhook: Post "https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/utkg-clusterclass-validate-cluster-x-k8s-io-v1beta1-clusterclass?timeout=10s": service "vmware-system-tkg-webhook-service" not found.
  7. Per wcp, it is unable to see the supervisor service package in packageInstalls which stalls the upgrade and the supervisor stays stuck in "configuring"
    error wcp [controller/core_service_controller.go:585] [opID=CoreServiceController] error registering core services: error creating spec for registering core service 'sample-pkg.test.carvel.dev': open /etc/vmware/wcp/supervisorservices/packages/sample-pkg.test.carvel.dev-1.0.0.yaml: no such file or directory
    info wcp [controller/core_service_controller.go:141] [opID=CoreServiceController] Reconciling core services on all Supervisors
    debug wcp [kubelifecycle/kube_instance_grouped_conditions.go:351] [opID=CoreServiceController] No Core Services found to set condition.
    debug wcp [controller/image_registry_controller.go:80] [opID=ContainerImageRegistryController] synchronizing Container Image Registries to all Supervisors
    debug wcp [logger/trace.go:77] [opID=ContainerImageRegistryController] [BEGIN] [supervisor/controller.(*ContainerImageRegistryController).syncImageRegistriesToSupervisors:130] synchronizing Container Image Registries onto Supervisor clusters

Environment

vSphere with Tanzu 8.0
VMware vCenter server 8.0.x

Cause

During RDU upgrades of vCenter to version 8.0U3 or 8.0U3a, the configuration files on VCSA from the source vCenter get copied onto the new target vCenter, overwriting the correct values from core-services.json and the Supervisor Services allow-list.txt. As a result, the wcpsvc is repeatedly searching for "sample-pkg.test.carvel.dev" which is an unreleased ID present as a placeholder in previous VC releases prior to the Core Services feature being enabled. 

Resolution

Issue is fixed in U3b. 

The following workaround is also available. If you are not on 8.0U3 (24022515) or 8.0U3a (24091160)  do NOT run this workaround as the files below are specific to those two versions only. 

    1. Replace all text in /etc/vmware/wcp/supervisorservices/core-services.json
      {
        "services": {
          "tkg.vsphere.vmware.com": {
            "versions": [
              {
                "content_type": "CARVEL_APPS_YAML",
                "content_file": "/etc/vmware/wcp/supervisorservices/packages/tkg-package.yaml",
                "yaml_service_config_file": ""
              }
            ],
            "install_by_default": true,
            "install_on_nonpodvm_supervisor": true
          },
          "velero.vsphere.vmware.com": {
            "versions": [
              {
                "content_type": "CARVEL_APPS_YAML",
                "content_file": "/etc/vmware/wcp/supervisorservices/packages/velero-package.yaml",
                "yaml_service_config_file": ""
              }
            ],
            "install_by_default": true,
            "migrate_from": "velero-vsphere",
            "install_on_nonpodvm_supervisor": false
          }
        }
      }
    2. Replace all text in /etc/vmware/wcp/supervisor-services-allow-list.txt with
      # List of SupervisorService IDs allowed to be created
      # if "allow_all_services" in the service config is turned off.
      # The following IDs correspond to the PSP services already shipped.

      # Minio
      minio

      # Cloudian
      hyperstore

      # Velero Services (vDPP and newer version)
      velero-vsphere
      velero.vsphere.vmware.com

      # ECS Objectscale
      objectscale

      # Sample service ID
      sample

      # Argo CD
      argo-cd

      # CA Cluster Issuer
      ca-clusterissuer.vsphere.vmware.com

      # Harbor from the TKG packages
      # See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/harbor/2.5.3/upstream-package.yaml
      harbor.tanzu.vmware.com

      # Contour from the TKG packages
      # See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/contour/1.18.2/upstream-package.yaml
      contour.tanzu.vmware.com

      # External DNS from the TKG packages
      # See https://gitlab.eng.vmware.com/core-build/tkg-packages/-/blob/main/standard/external-dns/upstream-metadata.yaml
      external-dns.tanzu.vmware.com

      # Wildcard pattern for allowing Flings. The following line permits services named as "service1.fling.vsphere.vmware.com" or "my-service.fling.vsphere.vmware.com".
      *.fling.vsphere.vmware.com

      # TKG Supervisor Service
      tkg.vsphere.vmware.com

      # NSX Management Proxy Supervisor Service
      nsx-management-proxy.nsx.vmware.com

      # CCI NS Supervisor Service
      cci-ns.vmware.com
    3. Restart the wcp service on the affected vCenter server.
      vmon-cli -r wcp

Additional Information