Detect Auto-Repair on errors for all Container Service Extension Clusters

Products

VMware Cloud Director

Issue/Introduction

It is possible that a provisioned cluster can go into an error state due to a known issue.

If the Auto Repair on Errors feature is activated on the cluster, that cluster can get deleted and recreated, which causes disruption of workloads on that cluster.
This article will help providers identify the scope of clusters that may be affected by this issue and update the cluster definitions to avoid it.

The auto-repair flag was added to Container Service Extension (CSE) to retry cluster creation when temporary errors (e.g.; timeouts) occur. The functionality is not disabled when the cluster reaches the Available state.

Environment

VMware Cloud Director 10.x

Resolution

This issue is resolved in Container Service Extension 4.1.1.

If you are unable to upgrade, use the detect-cluster-autorepair.sh script to identify which clusters have the auto-repair flag enabled. After identifying the affected clusters, visit the settings page for each cluster to disable this setting.

Environment Variables

# REQUIRED
export VCD_URL=           # https://vcd.cloud.local/api
export VCD_USER=          # administrator
export VCD_PASSWORD=

# OPTIONAL
export https_proxy=       # 10.2.3.4:3128

CLI Options

<org_name>             # Print usage for this organization
-A,--all-orgs          # Iterate over all organizations and print usage

-k,--insecure          # https://curl.se/docs/manpage.html#-k
--cacert path          # https://curl.se/docs/manpage.html#--cacert
--capath path          # https://curl.se/docs/manpage.html#--capath

--debug                # Print all commands to the console. Warning: this will expose passwords and API tokens.

-h
-v,--version

Note: Add --cacert /path/to/ca-certificates.pem if you are using self-signed certificates for VCD. You may alternatively use -k if you want to skip certificate validation.

Execution

Execute ./detect-cluster-autorepair.sh -A to print a report of all clusters and a PASS/WARN/FAIL result based on the auto-repair flag.

PASS = entity.spec.vcdKe.autoRepairOnErrors is not set on the Cluster.
WARN = entity.spec.vcdKe.autoRepairOnErrors is set on a Cluster that is in provisioning or error state.
FAIL = entity.spec.vcdKe.autoRepairOnErrors is set on a provisioned Cluster.

Example Output
<Flag State> . . . <Org Name>/<Cluster Name> - <Reason>

PASS ... solutions/harbor
PASS ... solutions/development
WARN ... alpha/development - The auto-repair flag is enabled and the cluster is in error state.
PASS ... alpha/banking
FAIL ... alpha/services
FAIL ... bravo/development - The auto-repair flag is enabled and the cluster is provisioned.
PASS ... bravo/banking
PASS ... bravo/services

Correction

Clusters that return a FAIL result should be updated immediately to disable the auto-repair flag. Clusters that return a WARN result should be evaluated to determine if changes are necessary.

These steps may be taken by the cluster author or a system user with appropriate privileges.

Open the VMware Cloud Director UI.
Browse to More -> Kubernetes Container Clusters.
Click on the name of the affected Cluster.
Click on Settings.
Disable the Auto Repair on Errors option.
Click Save.
Repeat these steps for all affected Clusters.
Execute the script again to confirm the changes are reflected in the results.

Attachments

detect-cluster-autorepair get_app