Volume Node Affinity Conflict - Incorrect Node Labels During Tanzu Hub Upgrade When VMs are Recreated

Products

VMware Tanzu Platform - Hub

Issue/Introduction

During an upgrade where we have BOSH VM changes, (Stemcells, or any other change that recreates the bosh VMs), previous nodes are drained and VMs are recreated during the upgrade.

In such cases, newly created VMs do not retain the existing labels, and the errand assigns labels based on the new VM IP address. As a result, the the stateful pods (anything dependent on node label: kubernetes.io/hostname) remain in a Pending state.

For example, the postgres pods and PV's.

The PV for postgres ql-pg-data-postgresql-0 has a nodeAffinity required flag that points to the node label of "kubernetes.io/hostname=<IP>"

nodeAffinity:
  required:
    nodeSelectorTerms:
    - matchExpressions:
       - key: kubernetes.1o/hostname
         operator: In values:
         - 10.###.##.##

If the IP address changes for the postgres VM, this label incorrectly points to the new IP in the new node, but the PV still references the old IP, causing a mismatch. The pod will hang in Pending state with a scheduling error:

"1 node(s) had volume node affinity conflict".

Environment

Tanzu Hub 10.3.0

Cause

The kubelet manifest does not assign stable hostnames and bosh set hostname as IP address of the VM. If during VM recreation, the IP address changes, the hostname changes as well, and the existing PV cannot be rebound to the node. The PV node affinity is hardcoded to hostname in local path provisioner.

Resolution

This will be fixed in a future release of Tanzu Hub.

---

Manual Steps to Fix

Set BOSH deployment ID

export BOSH_DEPLOYMENT=$(bosh deployments | grep -E '^hub-[0-9a-f]{20}' | awk '{print $1}')

List all VM with IP address and attached PV

bosh instances --details --json | jq -r '.Tables[0].Rows[] | select(.disk_cids != "") | "\(.instance) \(.ips)"' | while read inst ip; do echo "=== $inst ($ip) ==="; bosh ssh "$inst" -c "ls -l /var/vcap/store" | grep -E 'pvc-|==='; done

Log in to the Registry VM
```
bosh ssh registry
```

Identify which PVs are bound to non-existing (IP changed) nodes

existing=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); kubectl get pv -o json | jq -r --arg nodes "$existing" '.items[] | select(.spec.nodeAffinity != null) | .metadata.name as $pv | .spec.nodeAffinity.required.nodeSelectorTerms[].matchExpressions[] | select(.key == "kubernetes.io/hostname") | .values[] | select(. as $v | ($nodes | split(" ") | index($v)) == null) | "\($pv) -> \(.)"'

IMPORTANT: Make sure the reclaim policy of all affected PVs is 'Retain'
```
kubectl get pv
```

For each affected PV:

Export the PV configuration stripping unnecessary fields

kubectl get pv pvc-116dde38-9943-49c3-b2b4-4f410e23eba6 -o json | jq 'del(.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid,.metadata.managedFields,.spec.claimRef,.status)' >pv.json

Edit PV configuration changing the IP address of the node from the list in first step:
```
vi pv.json
```

delete PV and finalizers

kubectl delete pv pvc-116dde38-9943-49c3-b2b4-4f410e23eba6 --wait=false
kubectl patch pv pvc-116dde38-9943-49c3-b2b4-4f410e23eba6 -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl wait --for=delete pv/pvc-116dde38-9943-49c3-b2b4-4f410e23eba6

Re-create the PV with new configuration
```
kubectl apply -f pv.json
```
Make sure the corresponding PVC is back to 'Bound' state
```
kubectl get pvc -n tanzusm
```