After a network outage, the supervisor cluster is stuck in a state of recreating workers, control plane nodes and virtual machines
book
Article ID: 421457
calendar_today
Updated On:
Products
VMware vSphere Kubernetes Service
Issue/Introduction
An network/power outage had happened in the environment where it had led to all components disconnecting from the network.
Running kubectl get machines show many of the virtuall machines are in provisioned and Deleting state.
Attempting to rollout, recreate the virtual machines, workers fail.
Within vCenter UI, you see many tasks that relate with VKS components such as deleting virtual machine, powering off virtual machine.
You see within the vCenter errors such as "The operation is not allowed in the current state."
You may see PDL and/pr APD events that had happened on iSCSI datastores, and some datastores may still be inaccesible as they have gone into PDL condition.
Environment
vSphere 8.x with VKS supervisor.
Cause
The issue happens due to the network outage causing both the network and datastores to go down in an unorganized manner.
This can cause varioues issues such as hostd overflow, module errors, the nature of the network outage, whether/if the network switches have correct configuration if they had suffered powered outage and retained the correct configuration or not.
Note: The above 2 are only examples and not an exhaustive list of what issues can happen in VKS when a widespreak network outage. The conditions of this KB should be met for this KB to be applicabe for failure condition.
Note: Restoring connectivity or recovering desired state for components other than VKS is not in the scope of this KB. For related products such as NSX, refer to their respective KBs by searching with the error message seen in each product.
Resolution
Ensure that the network outage condition had fully resolved, and the underlying infrastructure had been restored to the desired state. This includes established desired connnectivity, VLANs, no ongoing issues that remain as artefacts, the storage is accesible again and the vCenter/ESXi components are available as the baseline.
To resolve this issue restart any pods under these services, for example: kubectl rollout restart deploy -n vmware-system-vmop vmware-system-vmop-controller-manager vmware-system-nsx
Observe if the pods are restarted successfully, if not, attempt to restart vcenter services with: service-control --stop all && service-control --start --all
You may see other services/pods that are not running, attempt to restart them as well
To identify stuck virtual machines run below and the subsequent command to delete them to kickstart another initialization: kubectl get virtualmachineimagename kubectl delete virtualmachineimagename
Observe the tasks under vCenter UI, if they do not get completed and show errors such as "The operation is not allowed in the current state.". In this condition overwhelmingly likely that you need to follow the steps below:
Ensure that Cluster's DRS is set to Fully Automated.
Check CPU Over-Committment is disabled under Cluster Settings.
Put the hosts of stuck Deletion/Provisioning into maintenance mode, if for any reason you cannot enter mainteannce mode on hosts after VMs are moved out, from SSH: esxcli system maintenanceMode set -e true -m noAction esxcli system maintenanceMode get
Reboot the hosts Reboot the vCenter and reobserve the Provisioning/Deletion.
Reobserve the recreation of impacted components
If the above troubleshooting steps do not resolve the issue, open a support request with Broadcom Support.