After a network outage, the supervisor cluster is stuck in a state of recreating workers, control plane nodes and virtual machines
search cancel

After a network outage, the supervisor cluster is stuck in a state of recreating workers, control plane nodes and virtual machines

book

Article ID: 421457

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • An network/power outage had happened in the environment where it had led to all components disconnecting from the network. 
  • Running kubectl get machines show many of the virtuall machines are in provisioned and Deleting state.
  • Attempting to rollout, recreate the virtual machines, workers fail.
  • Within vCenter UI, you see many tasks that relate with VKS components such as deleting virtual machine, powering off virtual machine.
  • You see within the vCenter errors such as "The operation is not allowed in the current state."
  • You may see PDL and/pr APD events that had happened on iSCSI datastores, and some datastores may still be inaccesible as they have gone into PDL condition. 

Environment

vSphere 8.x with VKS supervisor.

Cause

  • The issue happens due to the network outage causing both the network and datastores to go down in an unorganized manner.
  • This can cause varioues issues such as hostd overflow, module errors, the nature of the network outage, whether/if the network switches have correct configuration if they had suffered powered outage and retained the correct configuration or not.
  • One of the manifestations of that the SCSI protocol deems datastores as PDL. More info on datastore accessibility lost conditions: https://knowledge.broadcom.com/external/article/318712
  • Note: The above 2 are only examples and not an exhaustive list of what issues can happen in VKS when a widespreak network outage. The conditions of this KB should be met for this KB to be applicabe for failure condition.
  • Note: Restoring connectivity or recovering desired state for components other than VKS is not in the scope of this KB. For related products such as NSX, refer to their respective KBs by searching with the error message seen in each product.

Resolution

  • Ensure that the network outage condition had fully resolved, and the underlying infrastructure had been restored to the desired state. This includes established desired connnectivity, VLANs, no ongoing issues that remain as artefacts, the storage is accesible again and the vCenter/ESXi components are available as the baseline.
  • If you have experience Guest control plane endpoints missing from supervisor cluster, refer to: https://knowledge.broadcom.com/external/article/323450/
  • To resolve this issue restart any pods under these services, for example:
    kubectl rollout restart deploy -n vmware-system-vmop vmware-system-vmop-controller-manager vmware-system-nsx
  • Observe if the pods are restarted successfully, if not, attempt to restart vcenter services with:
    service-control --stop all && service-control --start --all
  • You may see other services/pods that are not running, attempt to restart them as well
  • To identify stuck virtual machines run below and the subsequent command to delete them to kickstart another initialization:
    kubectl get virtualmachineimagename
    kubectl delete virtualmachineimagename
  • Observe the tasks under vCenter UI, if they do not get completed and show errors such as "The operation is not allowed in the current state.". In this condition overwhelmingly likely that you need to follow the steps below:
    • Ensure that Cluster's DRS is set to Fully Automated.

    • Check CPU Over-Committment is disabled under Cluster Settings.

    • Put the hosts of stuck Deletion/Provisioning into maintenance mode, if for any reason you cannot enter mainteannce mode on hosts after VMs are moved out, from SSH:
      esxcli system maintenanceMode set -e true -m noAction
      esxcli system maintenanceMode get

    • Reboot the hosts
      Reboot the vCenter and reobserve the Provisioning/Deletion.

    • Reobserve the recreation of impacted components
  • If the above troubleshooting steps do not resolve the issue, open a support request with Broadcom Support.