Increasing Timeouts for Cloud Director Container Service Extension to resolve slow Tanzu Kubernetes Grid cluster creation errors in non-production environments

Products

VMware Cloud Director

Issue/Introduction

Attempting to create a Tanzu Kubernetes Grid cluster fails in Cloud Director.
In the UI it may indicate 'VcdKeServerError' as the error type which may contain additional error info such as:

"Watched worker thread [threadId] exited for RDE [clusterName(clusterId)]"
"Watched worker thread [threadId]: reached to unexpected location for RDE [clusterName(clusterId)]"
The CSE Server logs /root/cse.log may also have errors of the form:

"Unexpected location reached."
"Watched worker thread [threadId] exited for RDE [clusterName(clusterId)]"

Environment

VMware Cloud Director 10.x
VMware Container Service Extension 4.x

Cause

This issue can occur if the Cloud Director network to which the Tanzu Kubernetes Grid cluster VMs are deployed is slow or if Tanzu Kubernetes Grid cluster VM creation in Cloud Director is taking a long time to complete.

By default, Container Service Extension estimates 3-4 minutes to hear back any sort of responses from Cloud Director.
If the VMs are taking more than the estimated time to be created, Container Service Extension does not receive any updates until VM creation is completed and the thread will exit due to these timeouts.
Container Service Extension will mark the cluster in an error during creation state and will then update the Cloud Director UI to show the failed cluster creation.

Resolution

To resolve the issue ensure that there is no network slowness on the Cloud Director network to which the Tanzu Kubernetes Grid cluster VMs are deployed and Tanzu Kubernetes Grid cluster VM creation in Cloud Director is rapid.

If the environment will have slow download speeds from the Broadcom container registry, projects.packages.broadcom.com, then consider using a local container registry as per the documentation, Set up a Local Container Registry in an Air-gapped Environment.

In non-production scenarios where this cannot be achieved we can alter the timeouts as a workaround using the steps are detailed below.

Workaround:
To workaround the issue use a REST Client or the Cloud Director API Explorer to edit the VCDKEConfig settings with updated timeout values.

WARNING: The workaround detailed below is only for testing and non-production purposes.
Increasing these timeouts can have further effects on the operations of CSE and are not supported in production environments.
The supported solution is to resolve the network and VM creation performance issues.

Log in as a System Administrator to the Cloud Director instance using a REST API client such as Postman or Curl. Use the X-VMWARE-VCLOUD-ACCESS-TOKEN header value as the Authorization Bearer token in the subsequent API calls. For information on logging in to the Cloud Director API see the documentation here, Logging In.
Get the VCDKEConfig ID for this Cloud Director instance using the Cloud Director API:

   Request:
   GET https://vcloud.example.com/cloudapi/1.0.0/entities/types/vmware/VCDKEConfig/1

   Request Headers:
   Accept: application/json;version=37.1
Authorization: Bearer {token}

   Note the id value from the JSON response which will be in a format like the following:
   urn:vcloud:entity:vmware:VCDKEConfig:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Get the current VCDKEConfig settings using the VCDKEConfig ID retrieved above:

   Request:
   GET https://vcloud.example.com/cloudapi/1.0.0/entities/urn:vcloud:entity:vmware:VCDKEConfig:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

   Request Headers:
   Accept: application/json;version=37.1
   Authorization: Bearer {token}
Copy the entire response body from step 3 above, and look to modify the field serverConfig in the body.
It should by default look like this:

   "serverConfig": {
      "rdePollIntervalInMin": 1,
      "staleHeartbeatIntervalInMin": 0,
      "heartbeatWatcherTimeoutInMin": 0
   }
Change the staleHeartbeatIntervalInMin and heartbeatWatcherTimeoutInMin settings to the desired values in the JSON:

   Original:
   ...
   "serverConfig": {
    "rdePollIntervalInMin": 1,
    "staleHeartbeatIntervalInMin": 0,
    "heartbeatWatcherTimeoutInMin": 0
   },
   ...

   Updated:
   ...
   "serverConfig": {
        "rdePollIntervalInMin": 1,
        "staleHeartbeatIntervalInMin": 20,
        "heartbeatWatcherTimeoutInMin": 60
   },
   ...
Put back the entire JSON configuration which includes the timeout changes to update the configuration:

   Request:
   PUT https://vcloud.example.com/cloudapi/1.0.0/entities/urn:vcloud:entity:vmware:VCDKEConfig:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

   Request Headers:
   Accept: application/json;version=37.1
   Content-Type: application/json
   Authorization: Bearer {token}

   Request Body:
   Full updated JSON edited in step 5.
Confirm the changes have been made by getting the VCDKEConfig settings again:

   Request:
   GET https://vcloud.example.com/cloudapi/1.0.0/entities/urn:vcloud:entity:vmware:VCDKEConfig:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

   Request Headers:
   Accept: application/json;version=37.1
   Authorization: Bearer {token}
Finally, restart the Container Service Extension services on the CSE servers so that it uses the new configuration. Do this by logging into the CSE server VMs and running:

systemctl restart cse

Additional Information

WARNING:
Changing the timeout settings will alter how Container Service Extension behaves and is not supported in production environments.

Increasing staleHeartbeatIntervalInMin to a value of 20 will make Container Service Extension take a longer time to re-process clusters.
For example after a cluster creation has failed, it will take 20 minutes before it attempts to repair.
Likewise, if a cluster has been successfully created and a cluster delete is issued right after, Container Service Extension will only begin deletion after 20 minutes.