PKS Cluster creation fails with "Error: Timed out pinging to node_id after 600 seconds"
search cancel

PKS Cluster creation fails with "Error: Timed out pinging to node_id after 600 seconds"

book

Article ID: 323917

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • You are unable to create a PKS cluster.
  • The bosh agent on vms created under PKS service network are not able to reach/ping Bosh Director. 
  • The bosh vms Command returns Unresponsive Agent for the Service-Instance 
  • You will see a similar a message similar to the following in the /var/vcap/sys/log/nats/nats.log file  located on the Bosh Director VM. [Connection between the node and the Bosh director is Timing Out with TLS error]
[7] 2020/10/30 14:36:56.475284 [ERR] 10.100.168.67:39409 - cid:397 - TLS handshake timeout
[7] 2020/10/30 14:36:56.660401 [ERR] 10.100.168.10:52862 - cid:398 - TLS handshake error: read tcp 10.100.100.182:4222->10.100.168.10:52862: i/o timeout
[7] 2020/10/30 14:36:56.660783 [ERR] 10.100.168.10:52862 - cid:398 - TLS handshake timeout
[7] 2020/10/30 14:37:00.016826 [ERR] 10.100.168.25:62462 - cid:399 - TLS handshake error: read tcp 10.100.100.182:4222->10.100.168.25:62462: i/o timeout

Note: The 10.100.100.182 is the Bosh Director IP address
  • When deploying a new cluster via PKS, the deployment fails with an error similar to:

Task 321 | 18:31:45 | Creating missing vms: worker/a5bd0919-a64b-4657-9ca0-dec8958aa57d (0) (00:10:49) 
L Error: Timed out pinging to 7eded48e-8a64-4582-8d80-7d128765b381 after 600 seconds 
Task 321 | 18:31:46 | Error: Timed out pinging to 692ea345-5302-400d-96f1-2aa00a60d089 after 600 seconds

 

Note: The preceding logs excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware Pivotal Container Service 1.x

Resolution

This issue can occur if the bosh agent on the kubernetes nodes is not reachable from Bosh/NSX-T Manager. One possible reason could be that Jumbo frames are not set correctly within environment.

To resolve this issue, ensure that MTU of 1600 is set on dvSwitch used by PKS. MTU should also be set to 1600 when creating an uplink profile within NSX-T manager by default.

Follow the troubleshooting steps below to identify the root cause.
Run ping test throughout environment with -s option.
  1. Test ping from an instance VM (located in the services network or other network where Director VM is not placed) to the Bosh director using ping -d -s 1490 ipaddress
    1. ​If this ping is successful, MTU is set correctly across the networks. Move on to step 2.
    2. If ping fails, MTU on a particular network is set incorrectly and needs to be changed to 1600.
  2. Test ping from host to the Edge management IP
    • If ping fails, the physical network is not enabled for Jumbo frames. (Customer needs to enable jumbo frames on their physical network)

Additionally, The Director and the nodes should communicate through a lightweight publish-subscribe messaging system called NATS. NATS is the messaging bus used by BOSH Director to communicate with the BOSH agents running on all BOSH deployed VMs. When BOSH creates a new VM in vSphere it will wait for the VM to boot and communicate back to the NATS server running on Director port 4222. If the VM does not communicate back after 10 mins this error is given and BOSH deletes the VM. For more information, see Components of BOSH

For NATS to work, BOSH agent on the K8S node VM needs to be able to connect to the BOSH Director (server) on TCP port 4222. For more information on firewall requirements for VMware PKS, see Firewall Ports and Protocols Requirements.