Supervisor Cluster Upgrade Hangs on Spherelet VIB Update

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When attempting to upgrade a Supervisor cluster, the process may become hung while updating the Spherelet VIBs on the ESXi hosts acting as worker nodes. In this state, the spherelet.log on the ESXi hosts will show timeout errors when attempting to reach the Supervisor Control Plane (CP) node via the Floating IP (FIP) on port 6443.

<YYYY-MM-DD>T<Time>Z No(13) spherelet[93740888]: I0115 <Time> 3740869 trace.go:219] Trace[2040300852]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:145 (<DD-MMM-YYYY> <Time>) (total time: 30001ms):
<YYYY-MM-DD>T<Time>Z No(13) spherelet[93740888]: Trace[2040300852]: ---"Objects listed" error:Get "https://<Supervisor_FIP>:6443/apis/storage.k8s.io/v1/volumeattachments?limit=500&resourceVersion=0": dial tcp <Supervisor_FIP>:6443: i/o timeout 30001ms (<Time>)
<YYYY-MM-DD>T<Time>Z No(13) spherelet[93740888]: Trace[2040300852]: [30.001228s] [30.001228s] END
<YYYY-MM-DD>T<Time>Z No(13) spherelet[93740888]: E0115 <Time> 3740869 reflector.go:148] k8s.io/client-go/informers/factory.go:145: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: Get "https://<Supervisor_FIP>:6443/apis/storage.k8s.io/v1/volumeattachments?limit=500&resourceVersion=0": dial tcp <Supervisor_FIP>:6443: i/o timeout
<YYYY-MM-DD>T<Time>Z No(13) spherelet[93740888]: W0115 <Time> 3740869 reflector.go:533] k8s.io/client-go/informers/factory.go:145: failed to list *v1.Service: Get "https://<Supervisor_FIP>:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp <Supervisor_FIP>:6443: i/o timeout

Cause

The issue is caused by a configuration setting where the rp_filter (Reverse Path Filter) mode on the Control Plane VM (CPVM) is set to strict instead of loose. Because the CPVM has a subnet configured that includes the ESXi host IP ranges, the system expects incoming packets to arrive on a specific interface (e.g., eth1). When these packets arrive via the management interface (eth0) instead, the system ignores them and fails to send the TCP ACK, leading to a connection timeout.

Resolution

To resolve this issue, you must modify the rp_filter configuration on the vCenter Server. Always create a backup of configuration files before making manual edits.

Log in to the vCenter Server via SSH or console.
Navigate to the configuration directory: cd /etc/vmware/wcp/
Backup the existing configuration file: cp wcpsvc.yaml wcpsvc.yaml.bak
Open the wcpsvc.yaml file using a text editor (such as vi): vi wcpsvc.yaml
Locate the following configuration section:
```
rp_filter_config:
  is_loose: false
```
Change the value from false to true:
```
rp_filter_config:
  is_loose: true
```
Save the file and exit the editor (:wq).
Restart the WCP service to apply the changes: service-control --restart vmware-vcap-wcp [Source: KCS-Tips-File]