Tanzu Hub not fully functioning after VMs restart or power-down/power-on

Products

VMware Tanzu Platform - Hub

Issue/Introduction

Following a Virtual Machine (VM) restart, hard reset, or power-off/power-on cycle, Tanzu Hub components fail to function correctly. The following specific failures are observed:

Antrea-Agent CrashLoopBackOff: The antrea-agent pods on Kubernetes worker nodes fail to start. Logs show the following fatal error: F1217 10:52:40.900641 1 main.go:54] Error running agent: error initializing agent: open /proc/sys/net/ipv4/conf/antrea-gw0/arp_announce: read-only file system
Registry Malfunction: The Registry service starts but fails to function. Investigation reveals that the storage root directory is empty or missing expected mount points.

Environment

Tanzu Hub 10.0~10.3

Cause

The issue is caused by a dependency on BOSH lifecycle scripts that are not triggered during a standard VM-level reboot or power cycle.

Antrea-Agent: By design, the antrea-agent pod is not privileged and cannot modify host-level kernel parameters. It relies on a BOSH pre-start script to initialize the /proc/sys/net/ipv4/conf/antrea-gw0/arp_announce parameter on the host.
Registry: The Registry component relies on a BOSH pre-start script to mount the overlay directories and storage paths.

When a VM is restarted at the OS/vSphere level, BOSH does not re-run these pre-start initialization scripts, leaving the host in an unconfigured state that the Kubernetes pods cannot self-correct.

Resolution

The product team is currently working on a permanent fix to ensure these configurations persist across reboots. Until then, use the following manual recovery steps:

Restore registry job

:~$ bosh -d hub-#### ssh registry -c "sudo /var/vcap/jobs/registry/bin/pre-start"
// wait about 10 seconds
:~$ bosh -d hub-#### ssh registry -c "sudo monit restart registry"

Restore antrea-agent pods

At first, check pods status and confirm the error message.

:~$ bosh -d hub-#### ssh system
system/####:~$ /var/vcap/packages/kubernetes/bin/kubectl --kubeconfig /var/vcap/jobs/kube-controller-manager/config/admin-kubeconfig -n kube-system get pods -owide
...
antrea-agent-lldjm                                                2/2     Running            1 (22d ago)    22d   ##.##.##.165   ##.##.##.165   <none>           <none>
antrea-agent-mwrr4                                                1/2     CrashLoopBackOff   44 (15s ago)   22d   ##.##.##.162   ##.##.##.162   <none>           <none>
antrea-agent-pb98f                                                2/2     Running            1 (22d ago)    22d   ##.##.##.163   ##.##.##.163   <none>           <none>
...

system/####:~$ /var/vcap/packages/kubernetes/bin/kubectl --kubeconfig /var/vcap/jobs/kube-controller-manager/config/admin-kubeconfig -n kube-system logs antrea-agent-mwrr4
...
E1217 12:05:35.833430       1 sysctl_linux.go:64] "Error when setting sysctl parameter" err="open /proc/sys/net/ipv4/conf/antrea-gw0/arp_announce: read-only file system" path="ipv4/conf/antrea-gw0/arp_announce" value=1
F1217 12:05:35.834116       1 main.go:54] Error running agent: error initializing agent: open /proc/sys/net/ipv4/conf/antrea-gw0/arp_announce: read-only file system
...

Locate the node which host the crashing antrea-agent pods and rerun prepare-antrea-nodes pre-start script.

:~$ bosh -d hub-#### is | grep ##.##.##.162
control/####               running    az3    ##.##.##.162    hub-####
:~$ bosh -d hub-#### ssh control/#### -c "sudo /var/vcap/jobs/prepare-antrea-nodes/bin/pre-start"

control/####: stdout | [Wed Dec 17 12:25:05 PM UTC 2025] Installing systemd-networkd configuration files for Antrea interfaces if needed
control/####: stdout | [Wed Dec 17 12:25:05 PM UTC 2025] Systemd version: 249
control/####: stdout | [Wed Dec 17 12:25:05 PM UTC 2025] Installing files
control/####: stdout | [Wed Dec 17 12:25:05 PM UTC 2025] Restarting systemd-networkd
control/####: stdout | [Wed Dec 17 12:25:06 PM UTC 2025] Setting arp_announce to 1 for antrea-gw0 interface

Antrea-agent pod back-off restart interval is 5 minutes, please wait at least 5 minutes and check if the crashing antrea-agent pods start functioning properly.