***This KB is only applicable if you are using NSX-T Container Plugin Tile and Tanzu Application Service***
In this KB article we will be discussing the mechanics behind the TAS and NSX-T integration when a network interface is not assigned to the application container.
When you push an application, the application is failing to stage due to an error when creating the container, and you see the error message in the application logs:
Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 failed to create container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015: external networker up: exit status 1
The error messages occurs when Garden is unable to create a container as it was unable to assign networking to the container. This error could occur for a multitude of reasons, we will explore what to check in TAS when you see these error messages in REP, Garden, or application logging.
As part of the integration between TAS and NSX-T for container networking, there are several processes that are added to the Diego Cells and Diego Database VMs. These processes watch for events via watcher threads or poll CF Components API waiting for events related to an application's lifecycle, these interactions play a crucial in the relay race that occurs when an application is being pushed to a Diego Cell.
On each of the Diego Cell VMs, the processes nsx-node-agent, ovsdb-server, ovs-vswitchd, are installed as part of the NSX-T integration. nsx-node-agent is installed on the Diego Cell to interface with the networking of a container during its application lifecycle; such as when a container is being created or destroyed, nsx-node-agent receives networking related details from NSX Local Control Plane via Host Machine's Hyperbus.
On the Diego Database VMs, the process NCP is installed as part of the NSX-T integration. The role of NCP is to act as the conduit between TAS and NSX-T API. NCP achieves this by connecting to BBS as a listener and monitoring for events, such as when a container is being created or modified; as part of BBS design it will send a LRP message/event to its listeners and its listeners will take action based on the type of LRP message/event. The events that NCP takes action on are container creation, container update, and container stops.
During the container creation/start a LRP event is emitted from BBS, this event is picked up by a thread running within ncp process, the thread sends the LRP create event type to NSX-T Manager as an API request, which triggers NSX-T Manager to send the app container its Logical Port assignment, Container Interface Assignment, VLAN, and other networking details to the NSX-T Logical Control Plane which in turn forwards the details to the nsx-node-agent. Note: NCP is not responsible for the handoff of networking details to the container, NCP will log if the request to NSX-T API is accepted.
With this in mind, there are several places where the flow of container networking assignment can be disrupted. We will cover common scenarios and steps you can take before reaching out NSX-T Support.
The components in bold are TAS components which we can check to ensure NCP and NSX-T communication workflow occurs are expected.
Cody --> CF Push --> NCP (Diego Database VM) --> NSXT Management Plane --> NSX Central Control Plane --> NSX Logical Control Plane --> Host VM Hyperbus --> Diego Cell Logical Port --> nsx-node-agent --> NSX-CNI --> garden (application container)
Troubleshooting Steps:
In the event that the container fails to stage with external networker up: exit status 1 then you should attempt the following troubleshooting steps.
Scenario 1: Checking communication status between NSX API or BBS API and NCP
We want to ensure that the communication between the NCP Leader and NSX-T API is healthy. To do that we will leverage the nsxcli, which is installed on the Diego Database VM. Note: the nsxcli will need to be executed on the NCP Leader.
1. Identify Diego Database VM hosting NCP Leader using nsxcli & bosh ssh. After the command executes we can identify the NCP Leader via STDOUT output, our NCP Leader will say This instance is the NCP master
bosh -d cf-382ebe75a0100ffa6525 ssh diego_database -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" -r Instance diego_database/0400c700-d138-4842-8dd2-e450710c4617 Stdout Mon Nov 14 2022 UTC 19:23:48.312 This instance is the NCP master Current NCP Master id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458 Current NCP Instance id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458 Last master update at Mon Nov 14 19:23:43 2022 Stderr Unauthorized use is strictly prohibited. All access and activity is subject to logging and monitoring. Connection to 172.36.1.15 closed. Exit Code 0 Error - Instance diego_database/1b5944c2-2de5-426e-a72a-aa74ca5f27c6 Stdout Mon Nov 14 2022 UTC 19:23:48.170 This instance is not the NCP master Current NCP Master id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458 Current NCP Instance id is 455aeaf4-a59d-4411-8d3f-4ba2e3598d8b Last master update at Mon Nov 14 19:23:47 2022 Stderr Unauthorized use is strictly prohibited. All access and activity is subject to logging and monitoring. Connection to 172.36.1.30 closed.
2. nsxcli has a built-in command to check if BBS API and NSX Manager API is reachable by NCP, we will execute the nsxcli on the Diego Database we identified earlier.
bosh -d cf-382ebe75a0100ffa6525 ssh diego_database/0400c700-d138-4842-8dd2-e450710c4617 -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-bbs status; sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-nsx status" -r Using environment '172.36.0.11' as client 'ops_manager' Using deployment 'cf-382ebe75a0100ffa6525' Task 141. Done Instance diego_database/0400c700-d138-4842-8dd2-e450710c4617 Stdout Mon Nov 14 2022 UTC 19:29:50.065 BBS Server status: Healthy Mon Nov 14 2022 UTC 19:29:51.184 NSX Manager status: ams2-nsxmgr-01.slot-21.pez.vmware.com: Healthy
3. If BBS Server Status is Unhealthy or a status that is not Healthy then you will need to investigate the BBS job hosted on each of the Diego Database VMs, you must ensure that BBS is in a running state on each of the Diego Database VMs when checking using monit.
Suggestion: Restart BBS and NCP Job on each Diego Database VM
4. If NSX Manager status is Unhealthy or a status that is not Healthy then you will need to investigate why the Host Diego Database VM (running NCP Leader) is unable to reach NSX Manager Nodes. tip: Navigate to NSX UI --> System --> Appliances to see the status of each NSX Manager nodea
Suggestion: Power Cycle NSX Manager Nodes via vCenter Reboot
Scenario 2: Ensure nsx-node-agent can communicate via Host Machine Hyperbus
In the application logs when staging an app you will a reference to the Diego Cell ID which is attempting to host the application container. We must check if the Diego Cell VM's Host is able to reach NSX LCP via Hyperbus, as the NSX LCP is responsible for handing off networking details of the container to the container runtime.
1. Identify the Diego Cell the application failed to stage on.
Downloaded nodejs_buildpack Downloaded go_buildpack Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 creating container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015 Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 failed to create container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015: external networker up: exit status 1
2. Identify the VM CID (vcenter name of VM) using the Diego Cell ID (the 2nd column of the command's output is our VM CID)
bosh -d cf-382ebe75a0100ffa6525 vms --column Instance --column "VM CID" | grep '11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4' diego_cell/11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 vm-31107db4-9fb2-4ed5-8072-c5dde6e82b74
3. In the NSX UI use the Search to find the Logical Port of the Diego Cell VM using VM CID. Click on Hyperlink
4. Ensure that the "Admin Status" is Up and a tag with SCOPE bosh/id is attached.
5. If the status is not UP or the Tag is missing, recreate the Diego Cell VM using the bosh CLI
bosh -d cf-382ebe75a0100ffa6525 recreate diego_cell/11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4
If the above steps do not resolve the issues with applications failing to push due to external networker up: exit status 1 then you should open a case with NSX T Support, as the expected communication between NCP and TAS are occurring as expected, there could be an issue with the # of IP addresses assigned in the IP Block, or other issues with NSX T which is outside of the knowledge realm of TAS support.