Unable to create or upgrade tkgi clusters after upgrading to TKGI 1.8
search cancel

Unable to create or upgrade tkgi clusters after upgrading to TKGI 1.8

book

Article ID: 345586

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:
  • You have enabled the Routable Pod Networks option by using the TKGI Network profiles.

  • You are unable to upgrade or create the clusters after upgrading the TKGI API VM to 1.8.

  • NCP service on Master node crashes and keep on restarting.

  • Pods are failing with error message: 'netplugin failed with no error message'

  • In the /var/vcap/sys/log/ncp/ncp.stderr.log on Master node, you see the IP block validation error traceback similar to:
    Traceback (most recent call last):
      File "/usr/local/bin/ncp", line 10, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/cmd/ncp.py", line 16, in main
        ncp_main.start_ncp(coe)
      File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/ncp/main.py", line 191, in start_ncp
        nsx_errors = common_utils.validate_nsx_config()
      File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 968, in validate_nsx_config
        ipnetwork_errors = _validate_mgr_ip_network()
      File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 758, in _validate_mgr_ip_network
        external_ip_space_ids)
      File "/usr/local/lib/python3.5/dist-packages/nsx_ujo/common/utils.py", line 905, in _validate_ip_network
        ip_family = owned_ip_blocks[0]['version']
    IndexError: list index out of range

  • In the /var/vcap/sys/log/kubelet/kubelet.stderr.log on Worker node, you see the netplugin related errors for pod creation:
    E0724 01:40:29.258490    9758 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" network for pod "coredns-5b6649768f-wj6tl": networkPlugin cni failed to set up pod "coredns-5b6649768f-wj6tl_kube-system" network: netplugin failed with no error message
    W0724 01:40:29.280299    9758 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "coredns-5b6649768f-wj6tl_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"
    W0724 01:40:29.280803    9758 pod_container_deletor.go:75] Container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6" not found in pod's containers
    W0724 01:40:29.282232    9758 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1591f61d797a3def811a62d6ea93a89d1c891e80960330f7ab46ab7fa93eecd6"



Environment

VMware PKS 1.x

Cause

NCP has added a new validation check for IP blocks in 3.0.1, which will return the index error if all of the following conditions are satisfied:

  1. In NCP configuration, nsx_v3.enable_snat is false and nsx_v3.container_ip_blocks is empty, which is applicable when you enable the “Routable Pod Networks ” by using network profile.
  2. There’s no IP block being shared by all clusters, e.g. no block has the tag {‘ncp/shared_resource’: ‘true’} in NSX-T manager.
  3. There’s no IP block with cluster tag {‘ncp/cluster’: <cluster_name>} in NSX-T manager.

Resolution

This is a known issue with NCP 3.0.1, and it will be resolved in NCP 3.0.2. Please check the TKGI release notes for NCP 3.0.2 compatibility for future TKGI versions.


Workaround:

To workaround this issue, create a new shared IP block in NSX-T manager to pass the IP block validation by NCP.

  1. Create a dummy IP block and it’s CIDR must not be overlapped by any cluster/other networks, such as 127.0.0.0/30.

  2. Add a tag on this IP block with the scope ncp/shared_resource and the value true.

  3. Restart NCP on all Master nodes for existing clusters.

  4. Try to deploy the new clusters again /upgrade the existing clusters by running “pks upgrade-cluster”.