vCenter Backups with Supervisor backup option selected fail with: A general system error occurred: failed to run cmd /usr/lib/vmware-wcp/backup-restore/backup.py on CPVM...

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

vCenter Backups with Supervisor backup option selected fail with: A general system error occurred: failed to run cmd /usr/lib/vmware-wcp/backup-restore/backup.py on CPVM VirtualMachine:vm-#####. Err: /usr/lib/vmware-wcp/backup-restore/backup.py: exit 1

This error is found on the vCenter backup logs under /var/log/vmware/applmgmt/backup.log

[ComponentScriptsBackup:PID-######] [Log::run:Log.py:64] ERROR: reason = 'failed to run cmd /usr/lib/vmware-wcp/backup-restore/backup.py on CPVM VirtualMachine:vm-#####. Err: /usr/lib/vmware-wcp/backup-restore/backup.py: exit 1'

The VM-ID for the Supervisor VM would be in the error message as vm-#####.

Upon reviewing the logs for that SV VM which would either be looked at directly via ssh* or from a log bundle under wcp-support-bundle-domain-c####-#######-##-#-##.tar_extracted/master-vm-#####.tgz

*To find the vmid from the UI, click on each SV VM in the vSphereUI and check the URL which will contain the vmid.

Checking this log on the SV VM /var/log/vmware/wcp/sv_backup_script.log shows the following error.

ERROR backup: Cmd ['/usr/local/bin/skopeo', '--insecure-policy', 'sync', '--src', 'docker', '--src-tls-verify=false', '--dest', 'dir', '--scoped', 'localhost:5000/vmware/registry-agent:0.0.10.17963681', '/var/lib/vmware/wcp/backup/tmpoxc9gwr7'] failed. ret=2, stdout=, stderr=time="YYYY-MM-DDTHH:MM:SSZ" level=info msg="Tag presence check" imagename="localhost:5000/vmware/registry-agent:0.0.10.17963681" tagged=true
time="2026-02-11T19:18:26Z" level=info msg="Copying image ref 1/1" from="docker://localhost:5000/vmware/registry-agent:0.0.10.17963681" to="dir:/var/lib/vmware/wcp/backup/tmpoxc9gwr7/localhost:5000/vmware/registry-agent:0.0.10.17963681"
time="YYYY-MM-DDTHH:MM:SSZ" level=fatal msg="Error copying ref \"docker://localhost:5000/vmware/registry-agent:0.0.10.17963681\": initializing source docker://localhost:5000/vmware/registry-agent:0.0.10.17963681: reading manifest 0.0.10.17963681 in localhost:5000/vmware/registry-agent: manifest unknown"
Traceback (most recent call last):
File "/usr/lib/vmware-wcp/backup-restore/backup.py", line 64, in run
result = subprocess.run(cmd, capture_output=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/bin/skopeo', '--insecure-policy', 'sync', '--src', 'docker', '--src-tls-verify=false', '--dest', 'dir', '--scoped', 'localhost:5000/vmware/registry-agent:0.0.10.17963681', '/var/lib/vmware/wcp/backup/tmpoxc9gwr7']' returned non-zero exit status 2.

Note down the name of the manifest for the resolution section. It may vary from case to case, but in this example it would be registry-agent:0.0.10.17963681

Environment

vSphere Supervisor 7.x, 8.x, 9.x

Cause

Issue is due to a job object on the supervisor cluster that was created from enabling the embedded harbor registry (Deprecated in 2023 in favor for Harbor as a Supervisor Service) that failed to be deleted when the embedded habor was deactivated.

Resolution

Find the job object that references the image from the error found in /var/log/vmware/wcp/sv_backup_script.log and remove the stale job object.

1. SSH into the SV VM's via https://knowledge.broadcom.com/external/article?legacyId=90194

2. Check all job objects under the vmware-system-registry namespace via

kubectl get jobs -n vmware-system-registry -o yaml | less

Then within the less shell do a forward search ( / ) for a job that references the manifest.

/registry-agent:0.0.10.17963681

Which should return 1 or 2 jobs that look similar to the following.

---
apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "YYYY-MM-DDT00:00:00Z"
  labels:
    controller-uid: ########-####-####-####-############
    job-name: harbor-##########-controller-registry-##########
  name: harbor-##########-controller-registry-##########
  namespace: vmware-system-registry
  resourceVersion: "389741805"
  uid: ########-####-####-####-############
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  selector:
    matchLabels:
      controller-uid: ########-####-####-####-############
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: ########-####-####-####-############
        job-name: harbor-##########-controller-registry-##########
    spec:
      containers:
      - args:
        - -test.coverprofile
        - /tmp/cover.out
        command:
        - /registry-agent
        env:
        - name: CRON_JOB
          value: ROTATE_SYSTEM_ADMIN_CREDENTIAL
        - name: NAMESPACE
          value: vmware-system-registry-##########
        - name: REGISTRY_DOMAIN
          value: ###.###.###.### 
        - name: REGISTRY_NAME
          value: harbor-########
        image: localhost:5000/vmware/registry-agent:0.0.10.17963681
        imagePullPolicy: IfNotPresent
        name: harbor-1459597518-controller-registry-credential-rotate-job
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      nodeSelector:
        node-role.kubernetes.io/master: ""
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoSchedule
        key: kubeadmNode
        operator: Equal
        value: master
status:
  completionTime: "YYYY-MM-DDT00:00:00Z"
  conditions:
  - lastProbeTime: "YYYY-MM-DDT00:00:00Z"
    lastTransitionTime: "YYYY-MM-DDT00:00:00Z"
    status: "True"
    type: Complete
  startTime: "YYYY-MM-DDT00:00:00Z"
  succeeded: 1
---

Notice that the image line matches the error message.

image: localhost:5000/vmware/registry-agent:0.0.10.17963681

Using the name/namespace line, delete the job or jobs found.

name: harbor-##########-controller-registry-##########
namespace: vmware-system-registry

Delete command is

kubectl delete job -n vmware-system-registry harbor-##########-controller-registry-##########

After this re-run the backup. It may fail on a different image and the process to find and remove stale jobs needs to be re-ran.

Additional Information

If any stale jobs are found outside of the vmware-system-registry namespace are found to be blocking the backup, please open a case with Broadcom technical support to investigate further. Do not delete them.