All SEs on Azure cloud got deleted and re-created
- Avi controller with SEs deployed in Azure.
- In Avi, with an Azure cloud, there is a periodic reconcile where the controller checks if the SEs it has in its database match with the ones that are deployed in Azure.
- To achieve this, the controller relies on the Azure API correctly returning the resources deployed.
- The core issue in this case was that the Azure API did not error out, and returned an empty list when the controller queried for SEs.
- This exposed a corner case in how the controller handles empty API responses.
-- An empty response indicates to the controller that the SEs do not exist in Azure. So if the SE that was created by the controller, no longer exists in the cloud we need to cleanup all references to it.
-- Its done in a certain order (starts with deletion of SEVMruntime), and eventually other SE objects.
-- At this time controller thought that there were no SEs in the cloud and it was cleaning its own DB entries only.
-- Once the API started working, Azure returned the list of SEs deployed. However, those SEs had already been deleted from the controller database. In such a case, we need to cleanup the SEs on the cloud, since the SE VMs will not be usable by controllers, without the metadata that we store in our DBs.
-- Because of this, since the SEVMruntimes were deleted, the controllers deleted the SEs revealing this issue.
- However, it was confirmed to be an issue from the Azure side.
- There was a scheduled migration on their side to enable a feature.
- During that migration, a misconfiguration led to a subset of resources not being fully migrated. Consequently, some resource group scope queries did not return the complete list of resources, and in certain instances, no resources were returned at all.
- From the Avi side, we have built a protective fix to avoid the immediate deletion of SEs when an empty response is received.
- However, eventually if the cloud keeps on returning an empty list, we have to believe that it is the source of truth and clean up the controller database.
- We have also added more logging to assist with the troubleshooting.
- This fix is available starting versions v22.1.7 and v30.2.2.