The SSP 5.1.1 upgrade fails during the first step — Upgrade Coordinator deployment — stalling at approximately 9% completion. The upgrade does not progress further. Worker nodes exhibit filesystem errors and memory pressure that prevent the Upgrade Coordinator pod from initializing successfully.
During an SSP 5.1.1 upgrade, the process fails at approximately 9% while deploying the Upgrade Coordinator. The upgrade does not progress beyond this first step.
The following errors are observed on the worker node console:
EXT4-fs (sde): VFS: Can't find ext4 filesystem
Additional grsec messages may appear, such as:
grsec: [spark-app-infra-classifier-pyspark-...] denied RWX mmap of <anonymous mapping> by /usr/bin/python3.10
OOM Killer events:
[15997609.463424] Memory cgroup out of memory: Killed process 4064067 (python3) total-vm:4716132kB, anon-rss:1559796kB...
[15997609.466367] Memory cgroup out of memory: Killed process 32484 (runc:[2:INIT]) total-vm:1091780kB...
The customer had a pre-existing CNS volume issue which was recovered prior to the upgrade attempt. However, the recovery left worker nodes in an unstable state, resulting in:
Perform a rolling restart of all worker nodes to clear stale filesystem state and release memory pressure. After the restart, retry the SSP 5.1.1 upgrade.
Steps
clusterctl alpha rollout restart machinedeployment/<md-name> -n <namespace>
Expected Outcome
After the rolling restart of worker nodes, the VFS ext4 errors and OOM conditions are resolved. The SSP 5.1.1 upgrade resumes and progresses successfully past the Upgrade Coordinator deployment step.