Unable to upgrade a VKS cluster because system packageInstalls (PKGI) managed by clusterbootstrap failed to deploy.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl describe clusterbootstrap -n <VKS cluster namespace> <VKS cluster name>
While connected to the affected VKS cluster context, the following symptoms are observed:
kubectl get pkgi -A
kubectl describe pkgi -n <pkgi namespace> <pkgi>
usefulErrorMessage: |
vendir: Error: Syncing directory '0':
Syncing directory '.' with imgpkgBundle contents:
Fetching image:
GET http://localhost:5000/v2/tkg/packages/core/<PKGI>/manifests/<VERSION>:
MANIFEST_UNKNOWN: manifest unknown
kubectl logs <docker-registry-pod> -n kube-system
level=error msg="response completed with error" err.code="manifest unknown" err.detail="unknown tag=<version>" err.message="manifest unknown" ... http.request.host="localhost:5000" ... http.request.uri="/v2/tkg/packages/core/<pkgi>/manifests/<version>" ... vars.name="tkg/packages/core/metrics-server" vars.reference="<version>"vSphere Supervisor
VKS Cluster
The system could be correctly reporting that one or more system PKGI managed by clusterbootstrap failed to deploy due to a missing manifest expected to be made available by the docker-registry in the affected VKS cluster control plane nodes. Each VKR version has its own local docker-registry that only has manifests for the system PKGI versions specific to that VKR version. During a VKS cluster upgrade, system PKGI show ReconcileFailed state with MANIFEST_UNKNOWN because the system is repeatedly checking for the manifests which will only be available once the new docker-registry pods on the desired VKR version are healthy.
However, if the expected docker-registry for the desired VKR version is running healthy in the affected VKS cluster and still reports the same manifest unknown errors, there is an issue with clusterbootstrap components for the affected VKS cluster.
This MANIFEST_UNKNOWN error is expected if the docker-registry for the desired VKR version is not yet running on one of the control plane nodes in the upgrading VKS cluster.
kubectl get cluster,kcp,md,ma -n <VKS cluster namespace>
kubectl get nodes
kubectl get pods -n kube-system -o wide | grep docker
kubectl describe pod -n kube-system <docker registry pod> | grep -i image
If docker-registry pods and associated control plane nodes are running healthy on the desired VKR version but the logs still report manifest unknown errors, reach out to VMware by Broadcom Technical Support referencing this KB article.