VKS clusterbootstrap system PKGI failing with MANIFEST_UNKNOWN
search cancel

VKS clusterbootstrap system PKGI failing with MANIFEST_UNKNOWN

book

Article ID: 437612

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

Unable to upgrade a VKS cluster because system packageInstalls (PKGI) managed by clusterbootstrap failed to deploy.

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • Describing the clusterbootstrap shows one or more system PKGI are in ReconcileFailed state:
    kubectl describe clusterbootstrap -n <VKS cluster namespace> <VKS cluster name>

 

While connected to the affected VKS cluster context, the following symptoms are observed:

  • One or more system PKGI are stuck in ReconcileFailed state.
    kubectl get pkgi -A

     

  • Describing the ReconcileFailed PKGI shows an error similar to the following:
    kubectl describe pkgi -n <pkgi namespace> <pkgi>
    
    usefulErrorMessage: |
          vendir: Error: Syncing directory '0':
            Syncing directory '.' with imgpkgBundle contents:
              Fetching image:
                GET http://localhost:5000/v2/tkg/packages/core/<PKGI>/manifests/<VERSION>:
                  MANIFEST_UNKNOWN: manifest unknown

     

  • One or more docker-registry pod logs show error code manifest unknown for each failed PKGI, where values in angle brackets <> will vary by environment and affected system PKGI:
    kubectl logs <docker-registry-pod> -n kube-system
    
    level=error msg="response completed with error" err.code="manifest unknown" err.detail="unknown tag=<version>" err.message="manifest unknown" ... http.request.host="localhost:5000" ... http.request.uri="/v2/tkg/packages/core/<pkgi>/manifests/<version>" ... vars.name="tkg/packages/core/metrics-server" vars.reference="<version>"

Environment

vSphere Supervisor

VKS Cluster

Cause

The system could be correctly reporting that one or more system PKGI managed by clusterbootstrap failed to deploy due to a missing manifest expected to be made available by the docker-registry in the affected VKS cluster control plane nodes. Each VKR version has its own local docker-registry that only has manifests for the system PKGI versions specific to that VKR version. During a VKS cluster upgrade, system PKGI show ReconcileFailed state with MANIFEST_UNKNOWN because the system is repeatedly checking for the manifests which will only be available once the new docker-registry pods on the desired VKR version are healthy.

However, if the expected docker-registry for the desired VKR version is running healthy in the affected VKS cluster and still reports the same manifest unknown errors, there is an issue with clusterbootstrap components for the affected VKS cluster.

Resolution

This MANIFEST_UNKNOWN error is expected if the docker-registry for the desired VKR version is not yet running on one of the control plane nodes in the upgrading VKS cluster.

  1. Connect into the Supervisor cluster context

  2. Confirm if an upgrade is in progress and on the status of all control plane nodes in the Supervisor cluster context:
    kubectl get cluster,kcp,md,ma -n <VKS cluster namespace>

     

  3. Connect into the affected VKS cluster context

  4. Check on the status of all control plane nodes in the VKS cluster context:
    kubectl get nodes

     

  5. Confirm on the health of all docker-registry pods within the VKS cluster:
    kubectl get pods -n kube-system -o wide | grep docker

     

  6. Note down the image version of each docker-registry pod running on control plane nodes:
    kubectl describe pod -n kube-system <docker registry pod> | grep -i image

     

  7. Manifest unknown errors are expected if the only healthy docker-registry pods are running on the old control plane nodes on the old VKR version.
    Priority should be to troubleshoot the health of the docker-registry pods for the new VKR version and any control plane node on the new VKR version.
    Docker-registry is expected to be one of the first pods that starts on any newly created node.

 

If docker-registry pods and associated control plane nodes are running healthy on the desired VKR version but the logs still report manifest unknown errors, reach out to VMware by Broadcom Technical Support referencing this KB article.