VcdKeServerError
' as the error type which may contain additional error info such as:"Watched worker thread [threadId] exited for RDE [clusterName(clusterId)]"
"Watched worker thread [threadId]: reached to unexpected location for RDE [clusterName(clusterId)]"
/root/cse.log
may also have errors of the form:"Unexpected location reached."
"Watched worker thread [threadId] exited for RDE [clusterName(clusterId)]"
This issue can occur if the Cloud Director network to which the Tanzu Kubernetes Grid cluster VMs are deployed is slow or if Tanzu Kubernetes Grid cluster VM creation in Cloud Director is taking a long time to complete.
By default, Container Service Extension estimates 3-4 minutes to hear back any sort of responses from Cloud Director.
If the VMs are taking more than the estimated time to be created, Container Service Extension does not receive any updates until VM creation is completed and the thread will exit due to these timeouts.
Container Service Extension will mark the cluster in an error during creation state and will then update the Cloud Director UI to show the failed cluster creation.
To resolve the issue ensure that there is no network slowness on the Cloud Director network to which the Tanzu Kubernetes Grid cluster VMs are deployed and Tanzu Kubernetes Grid cluster VM creation in Cloud Director is rapid.
If the environment will have slow download speeds from the Broadcom container registry, projects.packages.broadcom.com
, then consider using a local container registry as per the documentation, Set up a Local Container Registry in an Air-gapped Environment.
In non-production scenarios where this cannot be achieved we can alter the timeouts as a workaround using the steps are detailed below.
Workaround:
To workaround the issue use a REST Client or the Cloud Director API Explorer to edit the VCDKEConfig settings with updated timeout values.
WARNING: The workaround detailed below is only for testing and non-production purposes.
Increasing these timeouts can have further effects on the operations of CSE and are not supported in production environments.
The supported solution is to resolve the network and VM creation performance issues.
WARNING:
Changing the timeout settings will alter how Container Service Extension behaves and is not supported in production environments.
Increasing staleHeartbeatIntervalInMin to a value of 20 will make Container Service Extension take a longer time to re-process clusters.
For example after a cluster creation has failed, it will take 20 minutes before it attempts to repair.
Likewise, if a cluster has been successfully created and a cluster delete is issued right after, Container Service Extension will only begin deletion after 20 minutes.