Credhub Interpolate times out when pushing an APP in NSX-T environment
search cancel

Credhub Interpolate times out when pushing an APP in NSX-T environment

book

Article ID: 298051

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs VMware NSX for vSphere VMware NSX Data Center for vSphere

Issue/Introduction

The following error is observed during App Staging or App Starting:

ERR Unable to interpolate credhub refs: Unable to interpolate credhub references: Post https://credhub.service.cf.internal:8844/api/v1/interpolate: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)



Resolution

Here are some possible causes and solutions for this error:

  • Firewall rule or network issue that is dropping traffic between the Diego cell and credhub VM instances. This would require environment specific troubleshooting. For example, you can use the traceflow function within NSX-T to test connectivity from an app container to the credhub VM IP address on port 8844. Another test you could try is the following command, run from the Diego cell that launched the app container, and from the container itself
nc credhub.service.cf.internal 8844 -zv
  • A Segment Gateway is reused by multiple segments in NSX-T. This KB article https://broadcomcms-software-agent.wolkenservicedesk.com/wolken/esd/knowledge-base-view/view-kb-article?articleNumber=327296 describes this scenario and provides a workaround and a resolution.

  • In environments where there are several thousand app instances, it can take a long time for the TEP interfaces on the VNI to become active within ESX. When an app is being staged or started Diego app lifecycle process will try to interpolate any VCAP_SERVICES credentials with credhub and will timeout this operations after 45 seconds. In some cases it can take more than 45 seconds for the TEP interface to become active which results in this timeout error. Another symptom you may see in a network capture would be arp requests for this container instance being sent without any reply being emitted by the Diego cell. 

    • Currently there is no workaround for this issue. In some cases multiple cf push or cf start attempts will eventually be successful. This is a scaling issue with NSX-T that was addressed in NSX-T version 2.5.1.  
    • In addition, the Credhub Team added an enhancement that allows operators to adjust the request timeout value for Credhub cli in February, 2020 (https://github.com/cloudfoundry-incubator/credhub-cli/issues/82).



Additional Information

  • There is a scenario similar to the delay in activating the TEP interface which was also fixed in 2.5.1. This workaround was used: 

    • In this scenario, the Credhub VM receives the TCP SYN from the container and credhub responds with a SYN/ACK that never reaches the container.  The container again retransmits the previously sent SYN packet to the Credhub VM. However, when the Credhub responds with a SYN-ACK packet, it changes the TCP sequence number, because of a DDOS prevention feature in Ubuntu OS.
    • When NSX distributed firewall is enabled, every host in an NSX cluster maintains a Connection Table that monitors the connections to and from every VMs vNic. When a packet is intercepted by the NSX Distributed firewall, it will create an entry in the DFW Connection State Table for the packet.
    • Since the Ubuntu OS changed the Sequence number, the connection state table on the host intercepts the return traffic and finds that the sequence number of the SYN-ACK packet is different than the sequence number of the SYN packet and drops it. This is the expected behavior of DFW and other Stateful Firewalls.
    • To work around this, we added the Credhub VMs to the exclusion list.