TKGm workload cluster upgrade fails, first Control Plane node is created and added to the cluster. However the second CP machine object and VM is created but node is not added to the cluster.
The nodename in node and machine objects of first Control Plane node are set to localhost. The hostname of the VM is also set to localhost.
Any TKGm version
Nodenames in a cluster must be unique so the second CP node will not be added to the cluster with "localhost" nodename.
The hostname and nodename should not be set to localhost and this indicates that an invalid OS image is being used.
There may be multiple OS images with the same image version in vCenter and the wrong ones is being picked up.
Retrieve OS image version
kubectl get osimage <OS Image name> -o jsonpath='{.spec.image.ref.version}'
Sample:
kubectl get osimage v1.28.7---vmware.1-tkg.3-50fb7614ebf10b4a98fbb31220ac0fb1 -o jsonpath='{.spec.image.ref.version}'
v1.28.7+vmware.1-tkg.3-50fb7614ebf10b4a98fbb31220ac0fb1
Find the images in vCentre with for corresponding kubernetes version
govc find /<Datacentre name> -type m | grep <kubernetes version>
Example:
govc find /<DATACENTER> -type m | grep 1.28
...
/<DATACENTER>/vm/tkg/photon-5-kube-v1-28-7+vmware-1-tkg-3-50fb7614ebf10b4a98fbb31220ac0fb1
...
Check the version of the images. Search for "Id": "VERSION" and check "DefaultValue"
govc vm.info -json <full path to image> | jq
Sample output:
{
"Key": 10,
"ClassId": "",
"InstanceId": "",
"Id": "VERSION",
"Category": "Cluster API Provider (CAPI)",
"Label": "VERSION",
"Type": "string",
"TypeReference": "",
"UserConfigurable": false,
"DefaultValue": "v1.28.7+vmware.1-tkg.3-50fb7614ebf10b4a98fbb31220ac0fb1",
"Value": "",
"Description": ""
}
If there are multiple images with same version in vCentre, remove the invalid one.
Alternatively remove all the relevant images for the particular kubernetes version, download valid one from Broadcom Support Portal and upload to vCenter.
Rerun cluster upgrade.