Remediate Cluster task could fail on large scale vSAN cluster
book
Article ID: 326728
calendar_today
Updated On:
Products
VMware vSphere ESXi
Issue/Introduction
This article provides steps to safely avoid large scale cluster remediation being interrupted by intermittent network issues during cluster upgrade.
Symptoms: In a large scale vSAN cluster (>16 nodes), you experience these symptoms:
Remediate Cluster task fails.
You see errors in the User Interface similar to:
vSAN health test 'vSAN: Basic (unicast) connectivity check' reported an issue. Check the vSAN health.
or/and
vSAN health test 'vSAN: MTU check (ping with large packet size)' reported an issue. Check the vSAN health.
Environment
VMware vSphere 7.0.x
Cause
This issue occurs due to intermittent ping failures during the cluster upgrade.
Resolution
To resolve this issue:
Silence the below 2 vSAN network health tests:
a. Navigate to "Monitor" page of the cluster, select "vSAN - Skyline Health" section. b. Find "vSAN: Basic (unicast) connectivity check" under "Network" category. c. Click "SILENCE ALERT" and click "YES". d. Repeat #b and #c for "vSAN: MTU check (ping with large packet size)".
Navigate back to "Update" page, "Image" section, click "REMEDIATE ALL" to proceed the hosts upgrade.
After remediation task complete, restore alert for above 2 health tests:
a. Navigate to "Monitor" page of the cluster, select "vSAN - Skyline Health" section. b. Find "vSAN: Basic (unicast) connectivity check" under "Network" category. c. Click "RESTORE ALERT". d. Repeat #b and #c for "vSAN: MTU check (ping with large packet size)".
Additional Information
Impact/Risks: Large scale vSAN cluster remediation may be interrupted by intermittent network issues multiple times, and every time user has to manually intervene to proceed upgrade.