1 2021-05-28T02:35:57.685Z 41290dd2-db92-4fed-8121-4931d59785fc NSX 31669 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR" errorCode="NCP00010"] nsx_ujo.ncp.nsx.manager.node_service Failed to get node vif or TN ID for node b5465126-199e-4e3d-a297-503ffbf88464 in cluster pas-02
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox-6f8f48d8d5-kstn4 0/1 ContainerCreating 0 3m55s
$ kubectl describe pod busybox-6f8f48d8d5-kstn4
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned default/busybox-6f8f48d8d5-kstn4 to 242e8238-6ab4-45d0-8f12-07678398d388
Warning FailedCreatePodSandBox 27m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5f2438d9e80e5932c92e6a5d73998c29dffac3105888b52e84dbeee0831066a0" network for pod "busybox-6f8f48d8d5-kstn4": networkPlugin cni failed to set up pod "busybox-6f8f48d8d5-kstn4_default" network: netplugin failed: "1 2021-07-06T06:20:43.740Z 242e8238-6ab4-45d0-8f12-07678398d388 NSX 28366 - [nsx@6876 comp=\"nsx-container-node\" subcomp=\"nsx_cni\" level=\"INFO\"] __main__ nsx_cni plugin invoked with arguments: ADD\n1 2021-07-06T06:20:43.741Z 242e8238-6ab4-45d0-8f12-07678398d388 NSX 28366 - [nsx@6876 comp=\"nsx-container-node\" subcomp=\"nsx_cni\" level=\"INFO\"] __main__ Reading configuration on standard input\n1 2021-07-06T06:20:43.741Z 242e8238-6ab4-45d0-8f12-07678398d388 NSX 28366 - [nsx@6876 comp=\"nsx-container-node\" subcomp=\"nsx_cni\" level=\"INFO\"] __main__ Configuring networking for container 5f2438d9e80e5932c92e6a5d73998c29dffac3105888b52e84dbeee0831066a0\n1 2021-07-06T06:20:43.741Z 242e8238-6ab4-45d0-8f12-07678398d388 NSX 28366 - [nsx@6876 comp=\"nsx-container-node\" subcomp=\"nsx_cni\" level=\"DEBUG\"] __main__ Network config from input: {u'cniVersion': u'0.3.1', u'runtimeConfig': {u'portMappings': []}, u'name': u'nsx-cni', u'args': {u'cniSocket': u'/var/vcap/sys/run/nsx-node-agent/cni.sock'}, u'capabilities': {u'portMappings': True}, u'type': u'nsx'}\n"
...
In the ncp.stdout.log, you see this error.
2021-07-06T07:10:14.116Z 75aa2e96-6515-4103-a56b-8933d24f447a NSX 11076 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR" errorCode="NCP00010"] nsx_ujo.ncp.nsx.manager.node_service Failed to get node vif or TN ID for node 242e8238-6ab4-45d0-8f12-07678398d388 in cluster pks-0dea86f7-0f1a-43e3-87ed-fd178c5c3636
# identify active ncp job from TAS diego databases
$ bosh -d cf-5271d4c2c7b6f10846fd ssh diego_database "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" | grep "This instance is the NCP master"
diego_database/eefef677-96c6-4f8a-aef7-32aaabb8b9ad: stdout | This instance is the NCP master
# identify active ncp job from masters in a TKGI k8s cluster
$ bosh -d service-instance_722ab6c4-7ea6-4ab4-93c1-c64362d07d73 ssh master "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" | grep "This instance is the NCP master"
master/50e2ce22-acb5-4cd7-bbf2-e814d4ab81f8: stdout | This instance is the NCP master
In NSX-T environment, the logical switch port of each BOSH deployed VM should have a tag with scope "bosh/id". Its value is derived from BOSH VM GUID.
$ bosh -d service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636 vms --column "Instance" --column "VM CID"
Deployment 'service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636'
Instance VM CID
master/fecde9b5-d949-4965-be51-1589e93f4bce vm-2b84fec5-218c-4a1d-b40f-e0050a497368
worker/5fe5876f-d17c-4f00-9020-4482b4c3c3fe vm-ee81a0fc-de0a-44b5-93bc-0a1316780805
$ curl -s -k -u "$NSX_USER:$NSX_PASSWORD" -H 'content-type: application/json' "https://${NSX_MANAGER}/api/v1/search?query=resource_type:LogicalPort%20AND%20display_name:*vm-ee81a0fc-de0a-44b5-93bc-0a1316780805*" | jq '.results[] | .tags'
[
{
"scope": "bosh/id",
"tag": "93f5c888c9b2643d2412b892d98c6dfc64dfc90b"
}
]
$ echo -n "5fe5876f-d17c-4f00-9020-4482b4c3c3fe" | shasum
93f5c888c9b2643d2412b892d98c6dfc64dfc90b -
Additionally, the logical switch port of each master in the TKGI k8s cluster should have a tag with the scope "pks/k8smastervm". Its value is identical to the k8s cluster GUID.
$ bosh -d service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636 vms --column "Instance" --column "VM CID"
Deployment 'service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636'
Instance VM CID
master/fecde9b5-d949-4965-be51-1589e93f4bce vm-2b84fec5-218c-4a1d-b40f-e0050a497368
worker/5fe5876f-d17c-4f00-9020-4482b4c3c3fe vm-ee81a0fc-de0a-44b5-93bc-0a1316780805
$ curl -s -k -u "$NSX_USER:$NSX_PASSWORD" -H 'content-type: application/json' "https://${NSX_MANAGER}/api/v1/search?query=resource_type:LogicalPort%20AND%20display_name:*vm-2b84fec5-218c-4a1d-b40f-e0050a497368*" | jq '.results[] | .tags'
[
{
"scope": "pks/k8smastervm",
"tag": "0dea86f7-0f1a-43e3-87ed-fd178c5c3636"
},
{
"scope": "bosh/id",
"tag": "4e6b18a578cb1335ef1e6843b28b022213225114"
}
]
When the tag "bosh/id" is missing on a logical switch port, we would hit the error as described in Symptom 1 because the NCP job relies on this tag to find the logical switch port of a BOSH deployed VM.
When the tag "pks/k8smastervm" is missing on a TKGI master's logical switch port, the NSX-T load balancer sitting in front of the master loses this master as a backend pool member. As a result, the TKGI kubernetes cluster is not accessible through NSX-T load balancer if all masters lose the tag "pks/k8smastervm".
# in example below, 10.123.36.193 is the IP of NSX-T load balancer for TKGI k8s masters
$ kubectl get pods
Unable to connect to the server: dial tcp 10.####.##.###:8443: i/o timeout
To verify if tags are present or not on a VM's logical switch port, please perform the steps as follows.
# list k8s node name, ExternalIP
$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="ExternalIP")].address}{"\n"}{end}'
242e8238-6ab4-45d0-8f12-07678398d388 172.##.###.5
29360604-c8ff-40eb-9371-0f79688f2b11 172.##.###.3
3eda658f-d666-4c60-a7f7-fadf524efc14 172.##.###.4
# get BOSH VM GUID based on k8s node ExternalIP
$ bosh vms --column instance --column ips | grep "172.##.###.5"
worker/5fe5876f-d17c-4f00-9020-4482b4c3c3fe 172.##.###.5
$ bosh -d service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636 vms --column instance --column "VM CID"
Deployment 'service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636'
Instance VM CID
master/fecde9b5-d949-4965-be51-1589e93f4bce vm-2b84fec5-218c-4a1d-b40f-e0050a497368
worker/5fe5876f-d17c-4f00-9020-4482b4c3c3fe vm-ee81a0fc-de0a-44b5-93bc-0a1316780805
$ curl -s -k -u "$NSX_USER:$NSX_PASSWORD" -H 'content-type: application/json' "https://${NSX_MANAGER}/api/v1/search?query=resource_type:LogicalPort%20AND%20display_name:*vm-ee81a0fc-de0a-44b5-93bc-0a1316780805*" | jq '.results[] | .tags'
[]
bosh -d cf-5271d4c2c7b6f10846fd recreate diego_cell/9b53bcde-a720-4d72-99c6-701859355514
bosh -d cf-5271d4c2c7b6f10846fd recreate diego_cell
bosh -d service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636 recreate worker/5fe5876f-d17c-4f00-9020-4482b4c3c3fe
bosh -d service-instance_0dea86f7-0f1a-43e3-87ed-fd178c5c3636 recreate