PKS Flannel network gets out of sync with docker bridge network (cni0)
search cancel

PKS Flannel network gets out of sync with docker bridge network (cni0)

book

Article ID: 298578

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:

Upon manual restart of the VMS (via vcenter) Pod Rollout became stuck in the ContainerCreating status.

Possibility 1: kubectl describe <pod_name> 

Error will look like this:

Warning  FailedCreatePodSandBox  1m (x12 over 1m)  kubelet, ########-########-##########-########  Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "nginx-c58f88dd6-hqszg_default" network: "cni0" already has an IP address different from 10.200.##.#/24

Possibility 2: When I run ifconfig on a worker node I can see that flannel and cni are on different subnets ​

worker/xxxx:/var/lib/cni/networks# ifconfig
cni0      Link encap:Ethernet  HWaddr 4e:d5:14:c4:c8:f3
          inet addr:10.200.##.#  Bcast:10.200.##.###  Mask:255.255.255.0

flannel.1 Link encap:Ethernet  HWaddr 2e:e3:38:5b:d5:69
          inet addr:10.200.##.#  Bcast:0.0.0.0  Mask:255.255.255.255

 

 

Environment


Cause

Flannel assigns a subnet lease to the workers in the cluster, which expires after 24 hours (this is not configurable in flannel).
Upon restart of the VMS the flannel.1 and cni0 /24 subnet no longer match, which causes this issue.

Resolution

Currently this issue is not fixed in TKGI 1.9.  This is fixed in TKGI 1.9.3.

Here are the workaround steps:

Resolution 1:

  1. bosh ssh -d <deployment_name> worker -c "sudo /var/vcap/bosh/bin/monit stop flanneld"
  2. bosh ssh -d <deployment_name> worker -c "sudo rm /var/vcap/store/docker/docker/network/files/local-kv.db"
  3. bosh ssh -d <deployment_name> worker -c "sudo /var/vcap/bosh/bin/monit restart all"


Resolution 2:

Note: Only to be used if resolution 1 does not work.

  1. bosh ssh -d <deployment_name> worker -c "sudo /var/vcap/bosh/bin/monit stop flanneld"
  2. bosh ssh -d <> worker -c "ifconfig | grep -A 1 flannel"
  3. On a master node get access to etcd using the following KB 
  4. On a master node run `etcdctlv2 ls /coreos.com/network/subnets/`
  5. Remove all the worker subnet leases from etcd by running `etcdctlv2 rm /coreos.com/network/subnets/<worker_subnet>` for each of the worker subnets from point 2 above.
  6. bosh ssh -d <deployment_name> worker -c "sudo /var/vcap/bosh/bin/monit restart flanneld"