Supervisor upgrade stuck with error "System error occurred on Master node with identifier ###################"
search cancel

Supervisor upgrade stuck with error "System error occurred on Master node with identifier ###################"

book

Article ID: 414091

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Supervisor upgrade stuck after provisioning one Control Plane node with error seen from UI:

    System error occurred on Master node with identifier ###################. Details: Base configuration of node ################### failed as a Kubernetes node. See /var/log/vmware-imc/configure-wcp.stderr on control plane node ################### for more information.

  • root@############################ [ ~ ]# kubectl get nodes -A
    NAME                                                 STATUS   ROLES                      AGE       VERSION
    ############################   Ready    control-plane               2d23h   v1.27.5+vmware.wcp.x
    ############################   Ready    control-plane,master   367d     v1.26.8+vmware.wcp.x
    ############################   Ready    control-plane,master   367d     v1.26.8+vmware.wcp.x
    ############################   Ready    control-plane,master   367d     v1.26.8+vmware.wcp.x

  • Log snippet from /var/log/vmware-imc/configure-wcp. stdout shows "connection refused on port 6443":

YYYY-MM-DDTHH:MM:SS Syncing container images of master VM running pods from https://xx.xx.xx.xx:6443
{"error": "Exception", "message": "Failed to list imgpkg bundles Failed to run command: ['kubectl', 'get', 'ns', '-l', 'appplatform.vmware.com/serviceId', '-o', 'json'] ret=1 out={\n    \"apiVersion\": \"v1\",\n    \"items\": [],\n    \"kind\": \"List\",\n    \"metadata\": {\n        \"resourceVersion\": \"\"\n    }\n}\n err=The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?\n", "backtrace": ["  File \"/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py\", line 232, in main\n    syncer.sync(dry_run=args.dry_run)\n", "  File \"/usr/lib/vmware-wcp/upgrade/imagesync.py\", line 385, in sync\n    imageBundles = self.getDeployedImgpkgBundlesFromK8s()\n", "  File \"/usr/lib/vmware-wcp/upgrade/imagesync.py\", line 115, in getDeployedImgpkgBundlesFromK8s\n    raise Exception('%s %s' % (msg, str(e))) from e\n"]}

Environment

vSphere Kubernetes Service (VKS)

Cause

  • Update failed because the certificates had expired, which caused configure-wcp to fail on the 4th CPVM and prevented it from being added as a Kubernetes node.

  • Log snippet from indicates certifcates are expired for node /var/log/pods/kube-system_kube-apiserver-############/kube-apiserver/0.log 

YYYY-MM-DDTHH:MM:SS stderr 1 authentication.go:70] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SS is after YYYY-MM-DDTHH:MM:SS, verifying certificate SN=xxxxxxxxxxxxxxxx, SKID=, AKID=xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx failed: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SS is after YYYY-MM-DDTHH:MM:SS]"
YYYY-MM-DDTHH:MM:SS stderr   1 authentication.go:70] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SS is after YYYY-MM-DDTHH:MM:SS, verifying certificate SN=xxxxxxxxxxxxxxxx, SKID=, AKID=xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx failed: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SS is after YYYY-MM-DDTHH:MM:SS]"

Resolution

Below steps need to be followed in sequence to successfully update SV version: