Antrea-agent daemonset could cause orphaned volume mount point due to "device or resource busy"
search cancel

Antrea-agent daemonset could cause orphaned volume mount point due to "device or resource busy"

book

Article ID: 403469

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Kubelet fails to clean up volume from orphaned pod with error in kubelet.stderr.log, the log repeats every 2 seconds. <POD_GUID> here points to a previous pod of antrea-agent daemonset on the worker node. 

E0623 15:31:23.413146 1513010 kubelet_volumes.go:263] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"<POD_GUID>\" found, but failed to rmdir() subpath at path /var/vcap/data/kubelet/pods/<POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1: remove /var/vcap/data/kubelet/pods/<POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1: device or resource busy" numErrs=4

`mount` output on the worker node shows two antrea-agent pod volumes: 

/dev/sdb1 on /var/vcap/data/kubelet/pods/<OLD_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)
/dev/sdb1 on /var/vcap/data/kubelet/pods/<NEW_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)

Environment

  •  

Cause

Antrea-agent pod has a hostPath mapping as below, it maps directory /var/log/openvswitch in antrea-ovs container to directory /var/log/antrea/openvswitch on worker node. 

    - hostPath:
        path: /var/log/antrea
        type: DirectoryOrCreate
      name: host-var-log-antrea
...
    containers:
- args:
...
      name: antrea-ovs
...
      volumeMounts:
...
    - mountPath: /var/log/openvswitch
        name: host-var-log-antrea
        subPath: openvswitch
...

During some events, such as worker node restart, or containerd/kubelet restart, antrea-agent pod will be restarted. In this scenario, a new pod guid will be assigned. Once kubelet initialization finishes, a sync loop will be started to handle: 

  • synchronize pods events, for example, daemonset pod creation 
  • clean up orphaned resource, such as volumes with those pods prior to restart
  • other events

The problem is - pod synchronization is 2 seconds earlier than cleanup. As the result, new antrea-agent pod is created with new volume and mount point: 

  • /dev/sdb1 on /var/vcap/data/kubelet/pods/<NEW_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)

Then two processes ovs-vswitchd/ovsdb-server in antrea-ovs container lock the log files under above directory. According to above mapping, the two log files are also under 

  • /var/log/antrea/openvswitch
  • /dev/sdb1 on /var/vcap/data/kubelet/pods/<OLD_POD_GUID>/volume-subpaths/host-var-log-antrea/antrea-ovs/1 type ext4 (rw,relatime)

When cleanup handler in kubelet tries to remove the previous pod volume, because two logs files in the directory are being locked by new pod/container processes, the remove operation fails with error "device or resource busy". 

Resolution

Though this problem comes from kubelet and antrea-agent behavior, an issue has been raised up to Tanzu engineering for improvement review. As workaround, please `bosh ssh` into the worker and `umount` the orphaned volume.