It is possible that a provisioned cluster can go into an error state due to a known issue.
If the Auto Repair on Errors feature is activated on the cluster, that cluster can get deleted and recreated, which causes disruption of workloads on that cluster.
This article will help providers identify the scope of clusters that may be affected by this issue and update the cluster definitions to avoid it.
The auto-repair flag was added to Container Service Extension (CSE) to retry cluster creation when temporary errors (e.g.; timeouts) occur. The functionality is not disabled when the cluster reaches the Available state.
VMware Cloud Director 10.x
This issue is resolved in Container Service Extension 4.1.1.
If you are unable to upgrade, use the detect-cluster-autorepair.sh script to identify which clusters have the auto-repair flag enabled. After identifying the affected clusters, visit the settings page for each cluster to disable this setting.
# REQUIRED
export VCD_URL= # https://vcd.cloud.local/api
export VCD_USER= # administrator
export VCD_PASSWORD=
# OPTIONAL
export https_proxy= # 10.2.3.4:3128
<org_name> # Print usage for this organization
-A,--all-orgs # Iterate over all organizations and print usage
-k,--insecure # https://curl.se/docs/manpage.html#-k
--cacert path # https://curl.se/docs/manpage.html#--cacert
--capath path # https://curl.se/docs/manpage.html#--capath
--debug # Print all commands to the console. Warning: this will expose passwords and API tokens.
-h
-v,--version
Note: Add --cacert /path/to/ca-certificates.pem if you are using self-signed certificates for VCD. You may alternatively use -k if you want to skip certificate validation.
Execute ./detect-cluster-autorepair.sh -A to print a report of all clusters and a PASS/WARN/FAIL result based on the auto-repair flag.
Example Output
<Flag State> . . . <Org Name>/<Cluster Name> - <Reason>
PASS ... solutions/harbor
PASS ... solutions/development
WARN ... alpha/development - The auto-repair flag is enabled and the cluster is in error state.
PASS ... alpha/banking
FAIL ... alpha/services
FAIL ... bravo/development - The auto-repair flag is enabled and the cluster is provisioned.
PASS ... bravo/banking
PASS ... bravo/services
Clusters that return a FAIL result should be updated immediately to disable the auto-repair flag. Clusters that return a WARN result should be evaluated to determine if changes are necessary.
These steps may be taken by the cluster author or a system user with appropriate privileges.