How-to update DNS on vSphere with Tanzu Workload Cluster nodes after adding new DNS servers under Workload Management

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Provide a workaround to push DNS changes from the previous output to vSphere with Tanzu Workload Clusters effectively.

Symptoms:

After updating DNS servers for vSphere with Tanzu via the vSphere Web Client -> Supervisor Cluster -> Configure -> Network -> 'Management Network' and 'Workload Network', running pods and new pods on Workload Clusters do not recognize the changes.

Environment

VMware vSphere 8.0 with Tanzu
VMware vSphere 7.0 with Tanzu

Cause

This behavior occurs on Workload Clusters and not on Supervisor Cluster because the Supervisor Cluster's CoreDNS pod forwards requests directly to the DNS configured under Workload Management and updates immediately when changes are made. However, the Workload Cluster's CoreDNS pod forwards requests to the Workload Cluster node DNS configuration, which then relies on the node's resolv.conf to point to the DNS configured in Workload Management. The network file on the Workload Cluster node, located at /etc/systemd/network/10-gosc-eth0.network, only updates when nodes are manually recreated.

Resolution

There is no resolution for this issue, as the functionality operates as intended by design.

Workaround:

To work around this issue, you can recreate the nodes by changing the vmClass of the Workload Cluster nodes, which triggers a rolling update and updates the host file at /etc/systemd/network/10-gosc-eth0.network.

If changing the vmClass is not feasible (e.g., due to policy constraints or resource compatibility), an alternative approach is to manually trigger the rollout by patching the KubeadmControlPlane (KCP) and MachineDeployment (MD) resources of the cluster with a new annotation. This achieves the same result of recreating the nodes without modifying the VM class.

Option 1: Change the VM Class

Use the following command to make the changes:

# kubectl edit tkc -n <namespace> <tkc-name>

Ensure that the vmClass selected is added to the Workload Cluster's namespace in vCenter under Workload Management.

For TKCs with v1alpha2 vmClass nodes, make the following changes to cluster manifest:

spec:
topology:
    controlPlane:
      vmClass: string
    nodePools:
      vmClass: string

For TKCs with v1alpha1 vmClass nodes, make the following changes to cluster manifest:

spec:
topology:
    controlPlane:
      class: string
    workers:
      class: string

Option 2: Patch the Cluster Resources to Force Rollout

Once logged in to the Supervisor Cluster, use kubectl to retrieve the names and details of the relevant resources—primarily the KubeadmControlPlane and MachineDeployment objects for the VKS cluster in question. In this example, my cluster tkgs-cluster in the tkgs-cluster-ns vSphere namespace has a single worker node pool (represented by one MachineDeployment object) and a single-node control plane (represented by the KubeadmControlPlane object).

# kubectl get cluster -A | grep tkgs-cluster
tkgs-cluster-ns   tkgs-cluster   builtin-generic-v3.1.0   Provisioned   141d   v1.29.4+vmware.3-fips.1


# kubectl get kcp -n tkgs-cluster-ns
NAME                 CLUSTER        INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE    VERSION
tkgs-cluster-pnc5q   tkgs-cluster   true          true                   1          1       1         0             141d   v1.29.4+vmware.3-fips.1


# kubectl get md -n  tkgs-cluster-ns
NAME                                    CLUSTER        REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE    VERSION
tkgs-cluster-worker-nodepool-a1-k96lk   tkgs-cluster   1          1       1         0             Running   141d   v1.29.4+vmware.3-fips.1

You can patch the KubeadmControlPlane object using the following command to trigger a rollout of the control plane nodes:

kubectl patch kcp -n tkgs-cluster-ns tkgs-cluster-pnc5q --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"

This will initiate the rolling update. You can verify the rollout by listing the machines in the cluster:

NAME                                  CLUSTER       NODENAME                            PROVIDERID             PHASE         AGE   VERSION
tkgs-cluster-pnc5q-w5td9              tkgs-cluster  tkgs-cluster-pnc5q-w5td9            vsphere://<uuid>       Running       16h   v1.29.4+vmware
tkgs-cluster-pnc5q-xj88z              tkgs-cluster                                     vsphere://<uuid>       Provisioning  12s   v1.29.4+vmware
tkgs-cluster-wrkpl-a1-fgqwf           tkgs-cluster  tkgs-cluster-wrkpl-a1-fgqwf         vsphere://<uuid>       Running       16h   v1.29.4+vmware

Similarly, we can patch the MachineDeployment (MD) object using the following command to trigger a rollout of the worker nodes:

kubectl patch md -n tkgs-cluster-ns tkgs-cluster-worker-nodepool-a1-k96lk  --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"

This will initiate the rolling update. You can verify the rollout by listing the machines in the cluster:

NAME                               CLUSTER       NODENAME                            PROVIDERID           PHASE          AGE    VERSION
tkgs-cluster-pnc5q-xj88z           tkgs-cluster  tkgs-cluster-pnc5q-xj88z           vsphere://<uuid>     Running        19m    v1.29.4+vmware
tkgs-cluster-wrkpl-a1-fgqwf        tkgs-cluster  tkgs-cluster-wrkpl-a1-fgqwf        vsphere://<uuid>     Running        17h    v1.29.4+vmware
tkgs-cluster-wrkpl-a1-qqkfq        tkgs-cluster                                     vsphere://<uuid>     Provisioning   112s   v1.29.4+vmware

This approach provides a quick and effective way to manually trigger rolling updates of control plane and worker nodes without modifying other cluster configurations. All nodes have been successfully rolled out and are in a Running state, completing the manual update process.

# kubectl get ma -n tkgs-cluster-ns
NAME                                  CLUSTER      NODENAME                                PROVIDERID                              PHASE    AGE    VERSION
tkgs-cluster-pnc5q-xj88z              tkgs-cluster tkgs-cluster-pnc5q-xj88z                vsphere://<uuid>                        Running  24m    v1.29.4+vmware
tkgs-cluster-wrkpl-a1-qqkfq           tkgs-cluster tkgs-cluster-wrkpl-a1-qqkfq             vsphere://<uuid>                        Running  7m16s  v1.29.4+vmware

Additional Information

vSphere with Tanzu 7.0

Impact/Risks:

DNS lookups performed by workloads running in Workload Clusters may fail due to this issue.