VKS Cluster showing as "Health_Unspecified" in VKS Cluster management UI in VCF Automation 9.x
search cancel

VKS Cluster showing as "Health_Unspecified" in VKS Cluster management UI in VCF Automation 9.x

book

Article ID: 439364

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

  • Following a certificate rotation, guest cluster detachment workflows within the vSphere Kubernetes Service (VKS) fail to execute successfully.
  • This failure state manifests in the VMSP prelude namespace, where the cluster-reaper-service controller emits the following diagnostic logs:
    "YYYY-MM-DDTHH:MM:SSz", stdout F {""component"":""reaper-job"",""level"":""error"",""msg"":""PollService: failed, service=data-protection, job=Cluster Job rid=(rid:c:ef1a360d-####-####-####-99837aa3c484:<cluster>-6w5yf:<cluster>-00), uid=(c:01KN7#################), force=true, step=PollServices, error=rpc error: code = Unavailable desc = connection error: desc = \""transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ\"""",""request-id"":""ef72c5af-####-####-####-1d4af36de18a"",""time"":""YYYY-MM-DDTHH:MM:SSZ"",""trace-id"":""4b8cecbb-####-####-####-31b74e71d52d""}"
    
    "YYYY-MM-DDTHH:MM:SSZ", "msg":"failed to delete guest cluster resources", "component": "cluster-operations", "clusterName": "###-#######", "supervisorNamespace": "###-#####-#####", "force": false, "operation": "force detach after normal detach failed", "error":"no matches for kind \"ExtensionResourceOwner\" in version \"clusters.vksm.broadcom.com/v1alpha1\""}
    "YYYY-MM-DDTHH:MM:SSZ", "msg": "deleting resources", "component": "cluster-operations", "clusterName": "###-#######", "supervisorNamespace": "###-#####-#####", "force": false, "operation": "force detach after normal detach failed", "resource": "extensionresourceowners"
    

Environment

VCF Automation 9.x

Cause

This issue occurs as the dataprotection-server does not automatically restart following a certificate rotation.

It continues to present the old certificate causing other services to fail to connect.

 

Resolution

  • The dataprotection-server deployment in the prelude namespace of the VMSP cluster needs to be restarted after a certificate is rotated
  • Take SSH to the VCFA primary node and login as vmware-system-user and execute the below command:
    kubectl rollout restart deployment -n prelude dataprotection-server
  • Post restart, login to the VCF Automation UI and navigate to the VKS Cluster Health Tab and it should be showing as Healthy.