Triggering a node recreation on a TKGi cluster ends up with the node being missing.
The VM is created and visible in vCenter. It gets an IP address assigned but it's automatically deleted a few minutes later.
"bosh vms" command doesn't show the VM.
"bosh is --ps" command shows the instance, but no running processes there.
"tkgi clusters" command shows the cluster with failed status.
The issue can be seen during a cluster upgrade when nodes are recreated. It could also be seen if a manual/automatic recreation through Bosh is triggered.
Other symptoms include:
Other nodes in the same cluster cannot ping the newly created VM's IP.
Bosh Director VM cannot ping the newly created VM's IP.
This KB offers general troubleshooting steps to pinpoint the root cause of the issue.
Cause
There may be several reasons why this issue occurs.
The fact that other nodes in the cluster cannot ping the newly created VM indicate there may be an underlying network issue or something blocking the connections.
Another plausible cause could be a duplicate IP address.
Resolution
General troubleshooting steps
Make sure a new VM is created in vCenter when the TKGi upgrade or recreate operation is triggered. For this, you can login to vSphere Client and monitor the tasks, looking for the creation of a new VM. If difficult to monitor that way (i.e. too many ongoing tasks in vCenter), you can do:
# bosh tasks
Identify the ongoing creation/recreation task ID.
# bosh task <task-id> --cpi
If a new VM is being created, you should see entries such as "LogicalPorts found for vm 'vm-<>'". Take a note of the VM ID.
On vSphere Client, locate the newly created VM. You can search the VM ID identified in the previous step. Wait until it gets assigned an IP address.
Once it gets assigned an IP address, copy the IP and paste it in the Search Bar in vSphere Client. Make sure there're no duplicate IPs, i.e. several VMs using the same IP address.
On vSphere Client, select the VM > right click > Edit Settings > check the Network Adapter is marked as Connected.
Login to Bosh Director VM and execute:
# netstat -putan | grep <missing-vm-ip>
See if there's any ESTABLISHED connection with that IP address. If the connection isn't established and ping isn't working, there may be something blocking the connection.
Login to NSX UI and paste in the Search Bar the VM's IP address. Make sure there're no duplicate IPs, i.e. only one Network Adapter shows up.
On NSX UI (Manager UI), go to Plan & Troubleshoot > Traceflow. We'll launch two traces:
From any other node in the cluster (vm-id) to the missing node's IP. ICMP Type.
If something is blocking the connection, we should be able to see it in the trace. For example, a distributed firewall rule blocking it would look like below:
From the missing node (vm-id) to the Bosh Director VM's IP. TCP Type on port 4222.
A distributed firewall rule blocking the connection would look like below: