Credhub Interpolate times out when pushing an APP in NSX-T environment
search cancel

Credhub Interpolate times out when pushing an APP in NSX-T environment

book

Article ID: 298051

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The following error is observed during App Staging or App Starting:
ERR Unable to interpolate credhub refs: Unable to interpolate credhub references: Post https://credhub.service.cf.internal:8844/api/v1/interpolate: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


Environment

Product Version: 2.8

Resolution

Here are some possible causes and solutions for this error:
  • Firewall rule or network issue that is dropping traffic between the Diego cell and credhub VM instances. This would require environment specific troubleshooting. One test you could try is the following command run from the Diego cell that launched the app container. 
nc credhub.service.cf.internal 8844 -zv
  • A Segment Gateway is reused by multiple segments in NSX-T. See the VMware K B article https://kb.vmware.com/s/article/76735 for more information. If this is the identified issue, then follow the workaround and solution described in the KB article.
  • In environments where there are several thousand app instances, it can take a long time for the TEP interfaces on the VNI to become active within ESX. When an app is being staged or started Diego app lifecycle process will try to interpolate any VCAP_SERVICES credentials with credhub and will timeout this operations after 45 seconds. In some cases it can take more than 45 seconds for the TEP interface to become active which results in this timeout error. Another symptom you may see in a network capture would be arp requests for this container instance being sent without any reply being emitted by the Diego cell. 
    • Currently there is no workaround for this issue. In some cases multiple cf push or cf start attempts will eventually be successful. This is a scaling issue with NSX-T that is addressed in NSX-T version 2.5.1. Upgrading to 2.5.1 will resolve the slow TEP activation issues. 
    • In addition, the Credhub Team are working on an enhancement that allows operators to adjust the request timeout value for Credhub cli. You can track the progress of this change here https://github.com/cloudfoundry-incubator/credhub-cli/issues/82. Once the CLI is enhancement the Diego components will be able to consume this enhancement and expose this as a configurable timeout value. When the CLI changes are complete it could serve as a workaround for environments that have not yet upgraded NSX-T to 2.5.1.
  • There is also a scenario similar to the delay in activating the TEP interface which is also fixed in 2.5.1. But in this scenario there is a workaround. 
    • In this scenario, the Credhub VM receives the TCP SYN from the container and credhub responds with a SYN/ACK that never reaches the container.  The container again retransmits the previously sent SYN packet to the Credhub VM. However, when the Credhub responds with a SYN-ACK packet, it changes the TCP sequence number, because of a DDOS prevention feature in Ubuntu OS.
    • When NSX distributed firewall is enabled, every host in an NSX cluster maintains a Connection Table that monitors the connections to and from every VMs vNic. When a packet is intercepted by the NSX Distributed firewall, it will create an entry in the DFW Connection State Table for the packet.
    • Since the Ubuntu OS changed the Sequence number, the connection state table on the host intercepts the return traffic and finds that the sequence number of the SYN-ACK packet is different than the sequence number of the SYN packet and drops it. This is the expected behavior of DFW and other Stateful Firewalls.
    • To work around this, we can add the Credhub VMs to the exclusion list. 
    • Upgrade to NSX-T 2.5.1 to resolve this issue.