Orphaned Pod messages in log files

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Error message is repeatedly seen in the kubelet.stderr.log reporting of an Orphaned Pod

2023-01-23T12:47:00.659126Z 1fcc47d5-0087-48dc-934a-5de333a4e1a9.worker.pks-########-####-40a2-ba95-6d1b56e706c1.service-instance-########-####-40a2-ba95-6d1b56e706c1.bosh 
kubelet rs2 - [instance@47450 director="" deployment="service-instance_########-####-40a2-ba95-6d1b56e706c1" 
group="worker" 
az="az3" 
id="########-####-48dc-934a-5de333a4e1a9"] 
E0123 12:47:00.565283 9708 kubelet_volumes.go:245] 
"There were many similar errors. Turn up verbosity to see them." 
err="orphaned pod \"0d013b05-7fcc-4dfb-9b52-b02ae4677b77\" 
found, but error not a directory occurred when trying to remove the volumes dir" numErrs=1

Environment

Product Version: 1.14

Resolution

Cause:

Its not definite what is causing this issue, but there are several Kubernetes users that have reported similar issues in the past in the Kubernetes upstream project, a common scenario was unexpected reboot of the worker nodes using a volume:

Kubernetes issue 60987

Kubernetes issue 105536

You can check the worker node kernel logs to see if there were kernel boot events in /var/log/kern.log

2023-01-19T07:35:23.135838+00:00 8c06d8c8-338a-46ba-9326-4dce52e6c06d kernel: [    0.000000] Linux version 4.15.0-200-generic (buildd@lcy02-amd64-022) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #211~16.04.2-Ubuntu SMP Fri Nov 25 09:18:48 UTC 2022 (Ubuntu 4.15.0-200.211~16.04.2-generic 4.15.18)

Then, you can check if when the pod was created and deleted and if the reboot occurred while the worker was live. To see if an unexpected reboot may have caused this.

grep da66318c-ad5a-48d8-a9ab-fda0cf00d3cb ./nsx-node-agent/nsx-node-agent.stdout.log
 
2023-01-18T15:15:13.368Z 8c06d8c8-338a-46ba-9326-4dce52e6c06d NSX 10833 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Updated app nsx.master-########-####-48d8-a9ab-fda0cf00d3cb% with IP 10.###.###.5/28, MAC 04:50:56:##:##:##, vlan 8,gateway 10.###.###.1/28, CIF 6f1f9f9f-6310-4da9-####-############, wait_for_sync False
 
../..
 
2023-01-19T07:43:36.491Z 8c06d8c8-338a-46ba-9326-4dce52e6c06d NSX 7081 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Deleted app nsx.master-########-####-48d8-a9ab-fda0cf00d3cb% from cache

Resolution:

To remove the errors from the log, it is best that the specific worker is recreated. This will remove the error from the logs As of today, there is no Kubernetes upstream fix available.