You have enabled the Routable Pod Networks option by using the TKGI Network profiles.
You are unable to upgrade or create the clusters after upgrading the TKGI API VM to 1.8.
NCP service on Master node crashes and keep on restarting.
Pods are failing with error message: 'netplugin failed with no error message'
In the /var/vcap/sys/log/ncp/ncp.stderr.log on Master node, you see the IP block validation error traceback similar to:
Traceback (most recent call last):
File "/usr/local/bin/ncp", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/cmd/ncp.py", line 16, in main
ncp_main.start_ncp(coe)
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/ncp/main.py", line 191, in start_ncp
nsx_errors = common_utils.validate_nsx_config()
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 968, in validate_nsx_config
ipnetwork_errors = _validate_mgr_ip_network()
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 758, in _validate_mgr_ip_network
external_ip_space_ids)
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 905, in _validate_ip_network
ip_family = owned_ip_blocks[0]['version']
IndexError: list index out of range
In the /var/vcap/sys/log/kubelet/kubelet.stderr.log on Worker node, you see the netplugin related errors for pod creation:
E0724 01:40:29.258490 9758 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" network for pod "coredns-5b6649768f-wj6tl": networkPlugin cni failed to set up pod "coredns-5b6649768f-wj6tl_kube-system" network: netplugin failed with no error message
W0724 01:40:29.280299 9758 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "coredns-5b6649768f-wj6tl_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"
W0724 01:40:29.280803 9758 pod_container_deletor.go:75] Container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" not found in pod's containers
W0724 01:40:29.282232 9758 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"
NCP has added a new validation check for IP blocks in 3.0.1, which will return the index error if all of the following conditions are satisfied:
This is a known issue with NCP 3.0.1, and it will be resolved in NCP 3.0.2. Please check the TKGI release notes for NCP 3.0.2 compatibility for future TKGI versions.
To workaround this issue, create a new shared IP block in NSX-T manager to pass the IP block validation by NCP.
Create a dummy IP block and it’s CIDR must not be overlapped by any cluster/other networks, such as 127.0.0.0/30.
Add a tag on this IP block with the scope ncp/shared_resource and the value true.
Restart NCP on all Master nodes for existing clusters.
Try to deploy the new clusters again /upgrade the existing clusters by running “pks upgrade-cluster”.