Worker Nodes are unhealthy after migrating to a new Storage Policy in vSphere with Tanzu

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Worker Nodes, ESXi hosts, are showing as unhealthy after migrating Storage policies in vsphere with Tanzu. They are still trying to use both the old and new Storage Policies and Datastores based on the spherelet configuration. The old Storage Policy and Datastore are showing as unhealthy.

Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready True Fri, 15 Nov 2024 11:25:57 +0000 Fri, 15 Nov 2024 11:25:57 +0000 KubeletReady Spherelet is ready.
DiskPressure True Mon, 01 Jan 0001 00:00:00 +0000 Mon, 01 Jan 0001 00:00:00 +0000 Host can't access datastores failed to find any accessible datastores for storage policy <old-storage-policy-uid> datastore URls: [ds:///vmfs/volumes/<old-datatsore-id>/]
Addresses:

Environment

vSphere with Tanzu 8.0. U3d and above

Cause

The old Storage Policy remains in the Supervisor desired configuration.

In vCenter, under Workload Management > Supervisor > Configuration > Storage policy, you see that the new storage-policy is assigned to "Control Plane nodes", but no other policies.

get a list of the storage policies available

dcli +server 'http://localhost/api' com vmware vcenter storage policies list

|------------------------------------------|-------------------------------------------------------------------------------------------------------------|
|name |description |policy |
|------------------------------------------|--------------------------------------------------------------------------|----------------------------------|
|old-storage-policy | |old-storage-policy-uid |
|new-storage-policy | |new-storage-plolicy-uid |
|------------------------------------------|--------------------------------------------------------------------------|----------------------------------|

Checking the storage uid assigned in wcp desired configuration for the Supervisor and you find a match for the old-storage-policy under EphemeralStoragePolicy and ImageStorage.StoragePolicy:

/usr/lib/vmware-wcp/wcp-db-dump.py | grep -i storage

"ImageStorage": {

"StoragePolicy": "<old-storage-policy-uid>"

"MasterStoragePolicy": "<new-storage-policy-uid>",

"EphemeralStoragePolicy": "<old-storage-policy-uid>",

"storage_svcacct_pwd": "CENSORED",

"last_storage_pwd_rotation_timestamp": xxxxxxxxxxx,

Also you will find in the Spherelet configmap in the kube-system namespace that details of both the old and new Storage Policies are present.

kubectl get cm -n kube-system spherelet -o yaml | grep datastores:

datastores: '{"<new-storage-policy-uid>":["<new-datatsore-url>"],"<old-storage-policy-uid>":["<old-datatsore-url>"]}'

Resolution

Before you start, check the storage uid assigned in wcp

/usr/lib/vmware-wcp/wcp-db-dump.py | grep -i storage

"ImageStorage": {
"StoragePolicy": "<old-storage-policy-uid>"
"MasterStoragePolicy": "<new-storage-policy-uid>",
"EphemeralStoragePolicy": "<old-storage-policy-uid>",
"storage_svcacct_pwd": "CENSORED",
"last_storage_pwd_rotation_timestamp": 1733324403,

On the vCenter appliance shell session, log into dcli in interactive mode

dcli +i

1. List clusters and take note of the cluster moid, <domain-cxx>

com vmware vcenter namespacemanagement clusters list

2. List storage policies and check the <new-storage-policy>, and take note of the uid, <new-storage-policy-uid>

storage policies list

3. get cluster output and check for master_storage_policy, ephemeral_storage_policy and image_storage -> storage_policy

namespacemanagement software clusters get --cluster <domain-cxx>

4. Update ephemeral_storage_policy and image_storage to use the uid for uid for the target MasterStoragePolicy

namespacemanagement clusters update --cluster <domain-cxx> --ephemeral-storage-policy <new-storage-policy-uid>

namespacemanagement clusters update --cluster <domain-cxx> --image-storage-storage-policy <new-storage-policy-uid>

5. Restart WCP service on vCenter (this will reconfigure/reconcile the cluster)

service-control --stop wcp && service-control --start wcp

6. After you make the change, check that the new-storage-policy-uid

/usr/lib/vmware-wcp/wcp-db-dump.py | grep -i storage

"ImageStorage": {
"StoragePolicy": "<new-storage-policy-uid>"
"MasterStoragePolicy": "<new-storage-policy-uid>",
"EphemeralStoragePolicy": "<new-storage-policy-uid>",
"storage_svcacct_pwd": "CENSORED",
"last_storage_pwd_rotation_timestamp": 1733324403,

On the Supervisor,

7. Check if storage class, tkgs-storage-policy, is still in use.

kubectl get sc -A

kubectl get cm -n kubesystem spherelet -o yaml

8. Check that the Spherelet configmap in the kube-system namespace only reference the new Storage Policy and datastore

kubectl get cm -n kube-system spherelet -o yaml | grep datastores:

datastores: '{"<new-storage-policy-uid>":["<new-datatsore-url>"]}'

If the old storage class is gone. It should be safe to remove the old Storage policy from vCenter.