L Error: Timed out pinging VM 'vm-guid' with agent 'agent-id' after 600 seconds
search cancel

L Error: Timed out pinging VM 'vm-guid' with agent 'agent-id' after 600 seconds

book

Article ID: 298625

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Prerequisites

  • Infrastructure is vSphere with NSX-T
  • NSX-T deployment topology for TKGI is Hybrid Topology - with a hybrid topology, the PKS Management Network is on a routable subnet, while the Kubernetes Nodes Network uses a non-routable subnet (NAT mode is checked in the PKS tile)
Example (Hybrid Topology):



Description

  • Creating a TKGI (PKS) cluster or running the Smoke Test errand fails with the following errors:
Task 492 | 00:04:26 | Creating missing vms: worker/83750a61-3f1d-41ee-95d1-8205ed776848 (1) (00:11:13) L Error: Timed out pinging VM 'vm-dd50f8cc-3791-4a72-9434-cb4b55613f88' with agent '92015bf4-11c7-4cf5-9401-82c47a4efaaa' after 600 seconds 

Task 492 | 00:04:32 | Creating missing vms: worker/0f85c4af-e310-4e52-8521-edecc8e16787 (0) (00:11:19) L Error: Timed out pinging VM 'vm-31caa2c6-d907-4cf7-859c-09e7d1e63489' with agent 'b912302a-0330-475d-a826-e52022ff2fe1' after 600 seconds 

Task 492 | 00:04:34 | Creating missing vms: master/51466e97-093e-40ab-848b-2f5259b243e1 (0) (00:11:21) L Error: Timed out pinging VM 'vm-bec9263c-0cdc-47ff-a3ae-fe14e6a5311e' with agent '76f4cad8-035b-4cd1-a3ea-accec8bc46cd' after 600 seconds
  • This error occurs when remote bosh agent on the newly deployed VM(s) is not able to communicate with NATS process (listens on port 4222) running on the Bosh director VM. 


Environment

Product Version: 1.7

Resolution

Note: There are multiple scenarios where remote bosh agent is not able to communicate with NATS.

For example: IaaS Network configuration issue, duplicate IP on the network, stemcell build is corrupted, etc.

This article aim towards resolving the errors caused in a vSphere with NSX-T (using Hybrid Topology) environment that has incorrect/missing NAT rules on T0 router, specifically when SNAT rule is not setup to translate non-routable IPs (PKS cluster), to routable IPs (PKS Management Plane - includes bosh, ops manager, PKS API and DB VMs, etc.).

In a Hybrid NSX-T topology, you will see similar errors as seen in the Issue section above, when there is a problem with source NAT'ing to translate non-routable IPs (usually it's the subnet where you deploy your TKGI/PKS cluster) to routable IPs (usually it's the subnet where you have your Management plane deployed - Ops Manager, Bosh, PKS/TKGI, etc) In order to resolve this issue, verify the SNAT rules configured on T0 router and make corrections if there's an incorrect rule or if you are just missing one.

The following is an example on how you will setup the SNAT rule on T0 router:
  • PKS Management Plane CIDR is 10.40.14.0/24 (routable IPs)
  • CIDR used for a TKGI/PKS cluster is 172.31.0.0/24 (non-routable IPs)
SNAT rule for non-routable IPs to have communication with routable IPs will be as follows:


where 10.40.14.40 is a routable IP that can access PKS Management Plane VMs.