Best Practice: Shutting Down or Pausing TKGm Management and Workload Clusters During Physical Network Switch Maintenance
search cancel

Best Practice: Shutting Down or Pausing TKGm Management and Workload Clusters During Physical Network Switch Maintenance

book

Article ID: 422039

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

During planned physical network switch maintenance, customers often ask whether it is acceptable to keep Tanzu Kubernetes Grid (Multi-Cloud) (TKGm) Management and Workload Cluster nodes running if other virtual machines remain online, or whether these Kubernetes nodes should be shut down.

This article provides best-practice recommendations to prevent etcd corruption, control-plane instability, and cluster reconciliation issues following network outages.

Environment

VMware Tanzu Kubernetes Grid

Cause

Unlike regular virtual machines, TKGm cluster nodes host critical Kubernetes components such as control plane and etcd, which are highly sensitive to network interruptions.
If these nodes remain powered on during a network outage, communication loss among etcd members or control-plane nodes may lead to data corruption, reconciliation delays, or cluster instability after the network is restored.

Resolution

Recommendation:

It is strongly recommended to gracefully shut down all TKGm Management and Workload Cluster nodes before initiating physical network switch maintenance.

If shutting down is not feasible, an alternative approach is to pause the clusters prior to maintenance and unpause them after the network has been fully restored.

Pause/Unpause Procedure:

1.Check cluster pause status (no output if not paused):

#kubectl -n $NAMESPACE get cluster ${CLUSTER} -ojsonpath='{.spec.paused}' | jq .

2.Pause the cluster:
#kubectl -n $NAMESPACE patch cluster ${CLUSTER} --type merge -p '{"spec":{"paused": true}}'

3.Unpause the cluster:
kubectl -n $NAMESPACE patch cluster ${CLUSTER} --type merge -p '{"spec":{"paused": false}}'

====

Recommended Sequence:

Pause sequence:

  1. Management Cluster

  2. Management Nodes

  3. Workload Clusters

  4. Worker Nodes

Unpause sequence:

  1. Workload Clusters

  2. Worker Nodes

  3. Management Cluster

  4. Management Nodes

Note: Network Consideration: Ensure that the DHCP lease duration exceeds the maintenance window duration so that IP addresses remain retained after recovery.

Additional Information

For official documentation and detailed procedures, refer to:
TKG 2.5 – Shut Down and Restart Clusters

TKG 2.5 – Cluster Lifecycle Operations