Unable to create or upgrade tkgi clusters after upgrading to TKGI 1.8

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:

You have enabled the Routable Pod Networks option by using the TKGI Network profiles.
You are unable to upgrade or create the clusters after upgrading the TKGI API VM to 1.8.
NCP service on Master node crashes and keep on restarting.
Pods are failing with error message: 'netplugin failed with no error message'
In the /var/vcap/sys/log/ncp/ncp.stderr.log on Master node, you see the IP block validation error traceback similar to:
Traceback (most recent call last):
File "/usr/local/bin/ncp", line 10, in <module>
    sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/cmd/ncp.py", line 16, in main
    ncp_main.start_ncp(coe)
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/ncp/main.py", line 191, in start_ncp
    nsx_errors = common_utils.validate_nsx_config()
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 968, in validate_nsx_config
    ipnetwork_errors = _validate_mgr_ip_network()
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 758, in _validate_mgr_ip_network
    external_ip_space_ids)
File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 905, in _validate_ip_network
    ip_family = owned_ip_blocks[0]['version']
IndexError: list index out of range
In the /var/vcap/sys/log/kubelet/kubelet.stderr.log on Worker node, you see the netplugin related errors for pod creation:
E0724 01:40:29.258490    9758 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" network for pod "coredns-5b6649768f-wj6tl": networkPlugin cni failed to set up pod "coredns-5b6649768f-wj6tl_kube-system" network: netplugin failed with no error message
W0724 01:40:29.280299    9758 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "coredns-5b6649768f-wj6tl_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"
W0724 01:40:29.280803    9758 pod_container_deletor.go:75] Container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" not found in pod's containers
W0724 01:40:29.282232    9758 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"

Environment

VMware PKS 1.x

Cause

NCP has added a new validation check for IP blocks in 3.0.1, which will return the index error if all of the following conditions are satisfied:

In NCP configuration, nsx_v3.enable_snat is false and nsx_v3.container_ip_blocks is empty, which is applicable when you enable the “Routable Pod Networks ” by using network profile.
There’s no IP block being shared by all clusters, e.g. no block has the tag {‘ncp/shared_resource’: ‘true’} in NSX-T manager.
There’s no IP block with cluster tag {‘ncp/cluster’: <cluster_name>} in NSX-T manager.

Resolution

This is a known issue with NCP 3.0.1, and it will be resolved in NCP 3.0.2. Please check the TKGI release notes for NCP 3.0.2 compatibility for future TKGI versions.

Workaround:

To workaround this issue, create a new shared IP block in NSX-T manager to pass the IP block validation by NCP.

Create a dummy IP block and it’s CIDR must not be overlapped by any cluster/other networks, such as 127.0.0.0/30.
Add a tag on this IP block with the scope ncp/shared_resource and the value true.
Restart NCP on all Master nodes for existing clusters.
Try to deploy the new clusters again /upgrade the existing clusters by running “pks upgrade-cluster”.