NCP fails to stage container with external networker up error
search cancel

NCP fails to stage container with external networker up error

book

Article ID: 298154

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

 ***This KB is only applicable if you are using NSX-T Container Plugin Tile and Tanzu Application Service***


In this KB article we will be discussing the mechanics behind the TAS and NSX-T integration when a network interface is not assigned to the application container.

When you push an application, the application is failing to stage due to an error when creating the container, and you see the error message in the application logs:

Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 failed to create container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015: external networker up: exit status 1

The error messages occurs when Garden is unable to create a container as it was unable to assign networking to the container. This error could occur for a multitude of reasons, we will explore what to check in TAS when you see these error messages in REP, Garden, or application logging.

As part of the integration between TAS and NSX-T for container networking, there are several processes that are added to the Diego Cells and Diego Database VMs. These processes watch for events via watcher threads or poll CF Components API waiting for events related to an application's lifecycle, these interactions play a crucial in the relay race that occurs when an application is being pushed to a Diego Cell.

On each of the Diego Cell VMs, the processes nsx-node-agentovsdb-server, ovs-vswitchd, are installed as part of the NSX-T integration. nsx-node-agent is installed on the Diego Cell to interface with the networking of a container during its application lifecycle; such as when a container is being created or destroyed, nsx-node-agent receives networking related details from NSX Local Control Plane via Host Machine's Hyperbus.

On the Diego Database VMs, the process NCP is installed as part of the NSX-T integration. The role of NCP is to act as the conduit between TAS and NSX-T API. NCP achieves this by connecting to BBS as a listener and monitoring for events, such as when a container is being created or modified; as part of BBS design it will send a LRP message/event to its listeners and its listeners will take action based on the type of LRP message/event. The events that NCP takes action on are container creation, container update, and container stops.
 

During the container creation/start a LRP event is emitted from BBS, this event is picked up by a thread running within ncp process, the thread sends the LRP create event type to NSX-T Manager as an API request, which triggers NSX-T Manager to send the app container its Logical Port assignment, Container Interface Assignment, VLAN, and other networking details to the NSX-T Logical Control Plane which in turn forwards the details to the nsx-node-agent. Note: NCP is not responsible for the handoff of networking details to the container, NCP will log if the request to NSX-T API is accepted.

With this in mind, there are several places where the flow of container networking assignment can be disrupted. We will cover common scenarios and steps you can take before reaching out NSX-T Support.

The components in bold are TAS components which we can check to ensure NCP and NSX-T communication workflow occurs are expected.

Cody --> CF Push --> NCP (Diego Database VM) --> NSXT Management Plane --> NSX Central Control Plane --> NSX Logical Control Plane --> Host VM Hyperbus --> Diego Cell Logical Port --> nsx-node-agent --> NSX-CNI --> garden (application container)


Environment

Product Version: 3.0

Resolution

Troubleshooting Steps:

In the event that the container fails to stage with external networker up: exit status 1 then you should attempt the following troubleshooting steps.

Scenario 1: Checking communication status between NSX API or BBS API and NCP 

We want to ensure that the communication between the NCP Leader and NSX-T API is healthy. To do that we will leverage the nsxcli, which is installed on the Diego Database VM. Note: the nsxcli will need to be executed on the NCP Leader. 

1. Identify Diego Database VM hosting NCP Leader using nsxcli & bosh ssh. After the command executes we can identify the NCP Leader via STDOUT output, our NCP Leader will say This instance is the NCP master

bosh -d cf-382ebe75a0100ffa6525 ssh diego_database -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" -r

Instance   diego_database/0400c700-d138-4842-8dd2-e450710c4617
Stdout     Mon Nov 14 2022 UTC 19:23:48.312
           This instance is the NCP master
           Current NCP Master id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458
           Current NCP Instance id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458
           Last master update at Mon Nov 14 19:23:43 2022
Stderr     Unauthorized use is strictly prohibited. All access and activity
           is subject to logging and monitoring.
           Connection to 172.36.1.15 closed.

Exit Code  0
Error      -

Instance   diego_database/1b5944c2-2de5-426e-a72a-aa74ca5f27c6
Stdout     Mon Nov 14 2022 UTC 19:23:48.170
           This instance is not the NCP master
           Current NCP Master id is 3ddf17bd-b43d-4d13-a8ba-f3f90e6bd458
           Current NCP Instance id is 455aeaf4-a59d-4411-8d3f-4ba2e3598d8b
           Last master update at Mon Nov 14 19:23:47 2022


Stderr     Unauthorized use is strictly prohibited. All access and activity
           is subject to logging and monitoring.
           Connection to 172.36.1.30 closed.

2. nsxcli has a built-in command to check if BBS API and NSX Manager API is reachable by NCP, we will execute the nsxcli on the Diego Database we identified earlier.

bosh -d cf-382ebe75a0100ffa6525 ssh diego_database/0400c700-d138-4842-8dd2-e450710c4617 -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-bbs status; sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-nsx status" -r
Using environment '172.36.0.11' as client 'ops_manager'

Using deployment 'cf-382ebe75a0100ffa6525'

Task 141. Done

Instance   diego_database/0400c700-d138-4842-8dd2-e450710c4617
Stdout     Mon Nov 14 2022 UTC 19:29:50.065
           BBS Server status: Healthy

           Mon Nov 14 2022 UTC 19:29:51.184
               NSX Manager status:
                   ams2-nsxmgr-01.slot-21.pez.vmware.com: Healthy

3. If BBS Server Status is Unhealthy or a status that is not Healthy then you will need to investigate the BBS job hosted on each of the Diego Database VMs, you must ensure that BBS is in a running state on each of the Diego Database VMs when checking using monit. 
Suggestion: Restart BBS and NCP Job on each Diego Database VM
 

4. If NSX Manager status is Unhealthy or a status that is not Healthy then you will need to investigate why the Host Diego Database VM (running NCP Leader) is unable to reach NSX Manager Nodes. tip: Navigate to NSX UI --> System --> Appliances to see the status of each NSX Manager nodea
Suggestion: Power Cycle NSX Manager Nodes via vCenter Reboot 


Scenario 2: Ensure nsx-node-agent can communicate via Host Machine Hyperbus

In the application logs when staging an app you will a reference to the Diego Cell ID which is attempting to host the application container. We must check if the Diego Cell VM's Host is able to reach NSX LCP via Hyperbus, as the NSX LCP is responsible for handing off networking details of the container to the container runtime.

1.  Identify the Diego Cell the application failed to stage on.

 Downloaded nodejs_buildpack
   Downloaded go_buildpack
   Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 creating container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015
   Cell 11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 failed to create container for instance fc5aa3c1-dfa6-435a-b706-0449f7700015: external networker up: exit status 1

2. Identify the VM CID (vcenter name of VM) using the Diego Cell ID (the 2nd column of the command's output is our VM CID)

bosh -d cf-382ebe75a0100ffa6525 vms --column Instance --column "VM CID" | grep '11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4'
diego_cell/11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4 vm-31107db4-9fb2-4ed5-8072-c5dde6e82b74

3. In the NSX UI use the Search to find the Logical Port of the Diego Cell VM using VM CID. Click on Hyperlink

Screen Shot 2022-11-14 at 12.44.57 PM.png
4. Ensure that the "Admin Status" is Up and a tag with SCOPE bosh/id is attached.
Screen Shot 2022-11-14 at 12.46.08 PM.png
5. If the status is not UP or the Tag is missing, recreate the Diego Cell VM using the bosh CLI

bosh -d cf-382ebe75a0100ffa6525 recreate diego_cell/11e52c49-2f54-4fc7-ae25-fc4e6fbf58b4


If the above steps do not resolve the issues with applications failing to push due to external networker up: exit status 1 then you should open a case with NSX T Support, as the expected communication between NCP and TAS are occurring as expected, there could be an issue with the # of IP addresses assigned in the IP Block, or other issues with NSX T which is outside of the knowledge realm of TAS support.