Remediate Cluster task could fail on large scale vSAN cluster
search cancel

Remediate Cluster task could fail on large scale vSAN cluster

book

Article ID: 326728

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides steps to safely avoid large scale cluster remediation being interrupted by intermittent network issues during cluster upgrade.

Symptoms:
In a large scale vSAN cluster (>16 nodes), you experience these symptoms:
  • Remediate Cluster task fails.
  • You see errors in the User Interface similar to:

    vSAN health test 'vSAN: Basic (unicast) connectivity check' reported an issue. Check the vSAN health.

    or/and

    vSAN health test 'vSAN: MTU check (ping with large packet size)' reported an issue. Check the vSAN health.


Environment

VMware vSphere 7.0.x

Cause

This issue occurs due to intermittent ping failures during the cluster upgrade.

Resolution

To resolve this issue:
  1. Silence the below 2 vSAN network health tests:

    a. Navigate to "Monitor" page of the cluster, select "vSAN - Skyline Health" section.
    b. Find "vSAN: Basic (unicast) connectivity check" under "Network" category.
    c. Click "SILENCE ALERT" and click "YES".
    d. Repeat #b and #c for "vSAN: MTU check (ping with large packet size)".
     
  2. Navigate back to "Update" page, "Image" section, click "REMEDIATE ALL" to proceed the hosts upgrade.
  3. After remediation task complete, restore alert for above 2 health tests:

    a. Navigate to "Monitor" page of the cluster, select "vSAN - Skyline Health" section.
    b. Find "vSAN: Basic (unicast) connectivity check" under "Network" category.
    c. Click "RESTORE ALERT".
    d. Repeat #b and #c for "vSAN: MTU check (ping with large packet size)".


Additional Information

Impact/Risks:
Large scale vSAN cluster remediation may be interrupted by intermittent network issues multiple times, and every time user has to manually intervene to proceed upgrade.