- NCP tile is configured to use NSX policy API
- Many application instances within the same org are starting/stopping concurrently, this may happen during apply changes on the foundation.
- App deployments take a much longer time to be healthy or fail with error:
external networker up: exit status 1
TAS with NSX-T policy networking
NCP v4.1 & v4.2
When creating or restarting application containers NCP needs to provision new ports from the segment's ORG and IP addresses from the IP pool assigned to the segment. If the IP pool is exhausted or there aren't available ports on the segment, NCP will trigger a creation of a new segment for the ORG and assing an IP pool from the configured IP block.
In NSX policy the transaction for allocating IPs is asynchronous with the port creation. When multiple ports are created at the same time the IP is allocated even if segment port creation has not completed.
As a result, the number of IPs allocated in the pool can be higher than the number of ports on the segment. For example 245 ports but 253 IP's allocated.
This error can be found in NCP logs:
Failed to allocate ip for segment port port_XYZ due to IpPool exhaustion on segment seg_XYZ_0
SegmentPort ID port_XYZ is in ERROR state: Failed to get a valid IP from IpPool /infra/ip-pools/ipp_XYZ_0 with cidr {1}.
NCP handles this by marking a segment as "temporarily exhausted" and use or create another segment, then assing an IP pool from the configured IP block.
Due to a software defect in NCP tile for TAS this mechanism is not working and the only way for NCP to know if the segment can be selected is the count of ports on the segment.
This issue is resolved in NSX container plugin (NCP) version 4.2.2.0 & 4.1.2.3
As a workaround, users can consider:
1 - Lowering max_in_flight value for diego cells during upgrade; the default value should be 4%. Please note that this is the max_in_flight value for the diego_cell instance_group, not the max_in_flight value for containers.
In this way, there will be less application instances being concurrently recreated due to cell evacuation. This will significantly reduce the possibility of hitting this issue
2 - In case the issue is occurring, users can manually pace redeployment of applications. The ideal would be stop all the impacted applications, and restage them one by one. This will significantly reduce the concurrency of segment port creation, thus reducing the possibility of hitting this issue.