Pods are failing to start and stuck in ContainerCreating state with kubelet reporting:
RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"693b941c5########################\": plugin type=\"nsx\" failed (add): Failed to receive message header"
And containerd is reporting issues with an istio plugin "istio-cni".
"Failed to destroy network for sandbox \"693b941c5######################\"" error="plugin type=\"istio-cni\" name=\"istio-cni\" failed (delete): failed to find plugin \"istio-cni\" in path [/var/vcap/jobs/kubelet/packages/cni/bin]"
Nsx-node-agent is reporting that it cant plug interface into OVS bridge
NSX 12003 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01005"] nsx_ujo.agent.cni_watcher_lin Unable to plug interface 693######## into OVS bridge for container 693b941c5########################: OvsdbAppException.__init__() takes 1 positional argument but 2 were given
Openvswitch can't read the network device details
bridge|WARN|could not open network device 693########## (No such device)
TKGi with NCP
The root cause has not been identified, however the issue was no longer present after the removal of istio-cni.
Uninstall istio-cni and recreate Worker nodes along with persistent disk.
The uninstallation of istio-cni is not covered in this KB.
Drain the worker nodes
kubectl get nodes
kubectl drain <node name>
Recreate the Worker node with a new persistent disk
1. Preparation
# Resurrection: OFF
bosh update-resurrection off
bosh curl /resurrection
# Set the parameters
bosh vms
SERVICE_INSTANCE=service-instance_e1849014-e334-42b2-81c9-xxxxxxxxxxxx
# Target is worker node. Don't set master node
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'
VM_CID=vm-b4dea926-ec08-408d-a219-xxxxxxx
DISK_CID=disk-f96d5e9e-3572-47ef-8c40-xxxxxx
2. Delete the target worker node
# Delete the Worker node
bosh -d ${SERVICE_INSTANCE} delete-vm ${VM_CID}
# Detach the Persistent-disk
bosh -d ${SERVICE_INSTANCE} orphan-disk ${DISK_CID}
# Check
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'
3. Recreate a new worker node with a new persistent disk
bosh -d ${SERVICE_INSTANCE} manifest > ${SERVICE_INSTANCE}.yaml
bosh -d ${SERVICE_INSTANCE} deploy ${SERVICE_INSTANCE}.yaml --fix --skip-drain
4. Verify issue is no longer present and re-enable resurrection
# Check
bosh -d ${SERVICE_INSTANCE} vms
kubectl get nodes
kubectl get pods -A -o wide
# Resurrection: ON
bosh update-resurrection on
bosh curl /resurrection