Kubelet fails to clean up volume from orphaned pod with error in kubelet.stderr.log, the log repeats every 2 seconds. <POD_GUID> here points to a previous pod of antrea-agent daemonset on the worker node.
E0623 15:31:23.413146 1513010 kubelet_volumes.go:263] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"<POD_GUID>\" found, but failed to rmdir() subpath at path /var/vcap/data/kubelet/pods/<POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1: remove /var/vcap/data/kubelet/pods/<POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1: device or resource busy" numErrs=4
`mount` output on the worker node shows two antrea-agent pod volumes:
/dev/sdb1 on /var/vcap/data/kubelet/pods/<OLD_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)
/dev/sdb1 on /var/vcap/data/kubelet/pods/<NEW_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)
Antrea-agent pod has a hostPath mapping as below, it maps directory /var/log/openvswitch in antrea-ovs container to directory /var/log/antrea/openvswitch on worker node.
- hostPath:
path: /var/log/antrea
type: DirectoryOrCreate
name: host-var-log-antrea
...
containers:
- args:
...
name: antrea-ovs
...
volumeMounts:
...
- mountPath: /var/log/openvswitch
name: host-var-log-antrea
subPath: openvswitch
...
During some events, such as worker node restart, or containerd/kubelet restart, antrea-agent pod will be restarted. In this scenario, a new pod guid will be assigned. Once kubelet initialization finishes, a sync loop will be started to handle:
The problem is - pod synchronization is 2 seconds earlier than cleanup. As the result, new antrea-agent pod is created with new volume and mount point:
Then two processes ovs-vswitchd/ovsdb-server in antrea-ovs container lock the log files under above directory. According to above mapping, the two log files are also under
When cleanup handler in kubelet tries to remove the previous pod volume, because two logs files in the directory are being locked by new pod/container processes, the remove operation fails with error "device or resource busy".
Though this problem comes from kubelet and antrea-agent behavior, an issue has been raised up to Tanzu engineering for improvement review. As workaround, please `bosh ssh` into the worker and `umount` the orphaned volume.