Pods in ContainerCrating state and containerd fails to find istio-cni plugin

search cancel

Pods in ContainerCrating state and containerd fails to find istio-cni plugin

book

Article ID: 425467

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Pods are failing to start and stuck in ContainerCreating state with kubelet reporting:

RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"693b941c5########################\": plugin type=\"nsx\" failed (add): Failed to receive message header"

And containerd is reporting issues with an istio plugin "istio-cni".

"Failed to destroy network for sandbox \"693b941c5######################\"" error="plugin type=\"istio-cni\" name=\"istio-cni\" failed (delete): failed to find plugin \"istio-cni\" in path [/var/vcap/jobs/kubelet/packages/cni/bin]"

Nsx-node-agent is reporting that it cant plug interface into OVS bridge

NSX 12003 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01005"] nsx_ujo.agent.cni_watcher_lin Unable to plug interface 693######## into OVS bridge for container 693b941c5########################: OvsdbAppException.__init__() takes 1 positional argument but 2 were given

Openvswitch can't read the network device details

bridge|WARN|could not open network device 693########## (No such device)

Environment

TKGi with NCP

Cause

The root cause has not been identified, however the issue was no longer present after the removal of istio-cni.

Resolution

Uninstall istio-cni and recreate Worker nodes along with persistent disk.

The uninstallation of istio-cni is not covered in this KB.

Drain the worker nodes

kubectl get nodes
kubectl drain <node name>

Recreate the Worker node with a new persistent disk

1. Preparation

# Resurrection: OFF
bosh update-resurrection off 
bosh curl /resurrection

# Set the parameters
bosh vms
SERVICE_INSTANCE=service-instance_e1849014-e334-42b2-81c9-xxxxxxxxxxxx

# Target is worker node.  Don't set master node
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'

VM_CID=vm-b4dea926-ec08-408d-a219-xxxxxxx
DISK_CID=disk-f96d5e9e-3572-47ef-8c40-xxxxxx

2. Delete the target worker node

# Delete the Worker node
bosh -d ${SERVICE_INSTANCE} delete-vm ${VM_CID}

# Detach the Persistent-disk
bosh -d ${SERVICE_INSTANCE} orphan-disk ${DISK_CID}

# Check
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'

3. Recreate a new worker node with a new persistent disk

bosh -d ${SERVICE_INSTANCE} manifest > ${SERVICE_INSTANCE}.yaml
bosh -d ${SERVICE_INSTANCE} deploy ${SERVICE_INSTANCE}.yaml --fix --skip-drain

4. Verify issue is no longer present and re-enable resurrection

# Check
bosh -d ${SERVICE_INSTANCE} vms
kubectl get nodes
kubectl get pods -A -o wide

# Resurrection: ON
bosh update-resurrection on
bosh curl /resurrection

Feedback

thumb_up Yes

thumb_down No