"failed to get instance ID from cloud provider" returned after TKGI upgrade to version 1.18.x

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi) Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management

Issue/Introduction

For environments using AWS Cloud Provider, upgrade from older versions of TKGI to 1.18 on any version will report TKGI cluster Status as "failed" on Action "UPGRADE"
The BOSH task associated with the TKGI upgrade will report "1 errored" state, specifically for the "run errand apply-addons"
Viewing the task with --debug flag, the apply-addons errand will show a stderr failure due to "error: deployment \"coredns"\ exceeded its progress deadline"
From within the cluster context, kubectl get nodes command will return no nodes, or nodes with a name that doesn't match the "Private IP DNS name" in the AWS instance.
From within the cluster master node, viewing the aws-controller-manager leader logs (from /var/vcap/sys/log/aws-cloud-controller-manager.stderr.log) will return errors like: "failed to get provider ID for node" and "failed to get instance ID from cloud provider: instance not found"
The AWS instance matching the reported BOSH VM CID will appears as powered on in AWS.
In AWS, the VPC backing the subnet used by instance VMs is configured to use a "DHCP Option Set" that references a custom "Domain Name"

Environment

TKGI clusters starting with 1.18 on AWS

TKGM environments on AWS.

Cause

This problem occurs due to the migration from in-tree AWS Cloud Provider to out-of-tree AWS Cloud Provider during upgrade of TKGI to version 1.18.x. In the older in-tree Cloud Provider, if the "--cloud-provider" variable is added to kubelet and kube-proxy configurations, the "--hostname-override" variable will be ignored, this is detailed in the following upstream kubernetes docs.
The out-of-tree AWS Cloud Provider is architected differently than the in-tree provider and relies on different kubelet and kube-proxy configuration variables to run. In particular, the out-of-tree provider requires the kubelet and kube-proxy config variable "--cloud-provider=aws" be applied if the Cloud Provider is AWS. If this variable is set, and the environment uses custom domain names (configured in the AWS VPC DHCP Option Set), an additional variable must be added to the kubelet and kube-proxy config for "--hostname-override" to match the regional domain name assigned to the AWS Instance VM.
These requirements are noted in the prerequisites documentation in AWS, here: cloud-provider-aws
If the "--hostname-override" value doesn't match the regional domain name in the AWS Instance VM, the node might not join the cluster, or if if joins the cluster, it might never have the "node.cloudprovider.kuberenetes.io/uninitialized=true:NoSchedule" taint removed, preventing pod scheduling on the updated nodes.

Resolution

Engineering teams at VMware by Broadcom are working to resolve this, the fix will be released in a future version of TKGI.
The preferred path forward is to wait for official code fix, however, if this is a blocker: use the below workaround for resolution.

Additional Information

WORKAROUND:

At a high level, this workaround uses Bosh's os-conf release "runtime config" to apply node level file modifications in a pre-start script. The node level change is a modification of the "--hostname-override" variable in the kubelet_ctl and kube-proxy_ctl job scripts and a restart of the kubelet and kube-proxy monit processes. The "--hostname-override" variable is replaced with the eth0 IP address reverse lookup value, which provides the AWS Instance configured DNS value. The os-conf "runtime" configuration presented below is applied to all [worker] objects in ALL Bosh deployments not explicitly excluded in the "exclude.deployments" section on the configuration.

Upload os-conf-release:

bosh upload-release --sha1 daf34e35f1ac678ba05db3496c4226064b99b3e4 "https://bosh.io/d/github.com/cloudfoundry/os-conf-release?v=22.2.1"

Confirm upload:

bosh releases | grep os-conf
os-conf                       22.2.1                         a2154d6

Create runtime config:

NOTE: Please update the exclude.deployments: [] section to exclude any deployments this workaround should NOT be applied to. This section is comma delineated. The below command will create a file called runtime.yaml with the required configuration variables

cat <<'EOFA' > runtime.yml
releases: 
- name: "os-conf"
  version: "22.2.1" 
addons:
- name: aws-dhcp-configuration
  exclude: 
    deployments: [pivotal-container-service-<ID>,harbor-container-registry-<ID>]  
  include:
    instance_groups: [worker]
  jobs:
  - name: pre-start-script
    release: os-conf
    properties:
      script: |-
        #!/bin/bash
        sed -i "s_curl http://169.254.169.254/latest/meta-data/hostname_host \`ip a s eth0 \| grep inet \| awk -F\"inet \" '{print \$2}' \| awk -F\"/\" '{print \$1}'\` \| awk '{print \$5}' \| sed 's=.$=='_" /var/vcap/jobs/kubelet/bin/kubelet_ctl
        cat /var/vcap/jobs/kubelet/bin/kubelet_ctl | grep eth0
        sed -i "s_curl http://169.254.169.254/latest/meta-data/hostname_host \`ip a s eth0 \| grep inet \| awk -F\"inet \" '{print \$2}' \| awk -F\"/\" '{print \$1}'\` \| awk '{print \$5}' \| sed 's=.$=='_" /var/vcap/jobs/kube-proxy/bin/kube_proxy_ctl
        cat /var/vcap/jobs/kubelet/bin/kubelet_ctl | grep eth0
        monit restart kube-proxy
        monit restart kubelet
        echo "done"
EOFA

Update bosh runtime config:
```
bosh update-runtime-config runtime.yml
```
Verify
```
bosh runtime-config
```
Upgrade Cluster:

NOTE: The runtime configuration will only apply to clusters with an upgrade action
```
tkgi upgrade-cluster <CLUSTER_NAME>
```