SSP 5.1.1 Upgrade Fails at Upgrade Coordinator Deployment Due to Filesystem Corruption on Worker Nodes
search cancel

SSP 5.1.1 Upgrade Fails at Upgrade Coordinator Deployment Due to Filesystem Corruption on Worker Nodes

book

Article ID: 433300

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

The SSP 5.1.1 upgrade fails during the first step — Upgrade Coordinator deployment — stalling at approximately 9% completion. The upgrade does not progress further. Worker nodes exhibit filesystem errors and memory pressure that prevent the Upgrade Coordinator pod from initializing successfully.

During an SSP 5.1.1 upgrade, the process fails at approximately 9% while deploying the Upgrade Coordinator. The upgrade does not progress beyond this first step.

The following errors are observed on the worker node console:

EXT4-fs (sde): VFS: Can't find ext4 filesystem

Additional grsec messages may appear, such as:

grsec: [spark-app-infra-classifier-pyspark-...] denied RWX mmap of <anonymous mapping> by /usr/bin/python3.10

OOM Killer events:

[15997609.463424] Memory cgroup out of memory: Killed process 4064067 (python3) total-vm:4716132kB, anon-rss:1559796kB...
[15997609.466367] Memory cgroup out of memory: Killed process 32484 (runc:[2:INIT]) total-vm:1091780kB...

Environment

  • Upgrading from SSP5.1.0 to SSP 5.1.1
  • Kubernetes worker nodes with CRI-O container runtime
  • CNS (Container Native Storage) volumes in use
  • MinIO and Kafka-based workloads running on the cluster

 

Cause

The customer had a pre-existing CNS volume issue which was recovered prior to the upgrade attempt. However, the recovery left worker nodes in an unstable state, resulting in:

  • Stale or corrupt ext4 filesystem mount state on block devices (e.g., /dev/sde), causing the kernel to report VFS: Can't find ext4 filesystem errors.
  • Memory pressure on worker nodes leading to OOM (Out-of-Memory) kills, targeting Python and container init processes critical to the upgrade workflow.
  • The combination of storage and memory instability prevented the Upgrade Coordinator pod from successfully initializing, stalling the upgrade at ~9%.

Resolution

Perform a rolling restart of all worker nodes to clear stale filesystem state and release memory pressure. After the restart, retry the SSP 5.1.1 upgrade.

Steps

  • Identify all worker nodes in the cluster.
  • Restart each worker node one at a time (rolling restart). Allow the node to fully rejoin the cluster and reach Ready status before proceeding to the next. 
clusterctl alpha rollout restart machinedeployment/<md-name> -n <namespace>
  • Verify each restarted node shows Ready status in kubectl before continuing.
  • Once all worker nodes are healthy, re-initiate the SSP 5.1.1 upgrade.
  • Monitor the upgrade progress. The upgrade should advance past the 9% Upgrade Coordinator deployment step and continue to completion.

Expected Outcome

After the rolling restart of worker nodes, the VFS ext4 errors and OOM conditions are resolved. The SSP 5.1.1 upgrade resumes and progresses successfully past the Upgrade Coordinator deployment step.