We are seeing some sandbox creation issues on the foundation that is already upgraded.
It seems to affect a specific worker-node, if recreate is attempted on this worker-node bosh recreate operation get an error on the VM creation as the following example:
Error: Timed out pinging VM 'vm-<ID>' with agent '<AGENT-ID>' after 600 seconds
In summary
Only specific worker nodes on specific clusters are affected and sometime if the recreation is tried bosh director times out with above error
TKGi 1.19.x 1.20.x
Further investigation pointed out that all affected nodes were in once AZ and one specific cluster
After further analisys we exported all VMs from bosh that were unresponsive and also confirmed that there are some VMs that are unresponsive and are on the same cluster
Narrowing down to a single host, and after some additional checks we confirmed the issue is related to NSX fabric and may be a faulty process.
Placing the ESXi host in maintenance mode and migrating all VMs from it fixed the issue for both problems pod network creation and bosh ping timeout