vSAN latency experienced during vSAN storage rescan on host
search cancel

vSAN latency experienced during vSAN storage rescan on host

book

Article ID: 315528

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article is intended to inform about this issue, provide a workaround, and inform of the version of ESXi where this issue is resolved.

Symptoms:
In versions of vSAN 6.7 prior to 6.7 P04 (17167734) and 7.0 prior to 7.0 U1C/P02 (17325551), a latency spike is seen on vSAN objects when a vsanmgmtd process issued storage rescan is performed. This can also occur when collecting ESXi support bundles, as well as during log collection, when a storage scan is performed to collect storage information.
  • This latency spike may reach multiple seconds of latency and persist for several minutes. This potentially impacts guest VM operations. The latency spike is random and may not occur on a regular basis, or may occur frequently with each vSAN rescan.

You will see this in the vSAN trace: 

YYYY-MM-DDTHH:MM:SS [########] [cpu##] [e3e06a84 COMP scrubComponent VMDISK] DOMTraceOpTookTooLong:####: {'op': 0x45a46151d000, 'objUuid': '########-####-####-####-############', 'offset-39': ##########, 'length-25': 65536, 'totalTimeMS': 2259, 'timeInThisPhaseMS': 8, 'opPhase': 'Wait for LSOM'}

YYYY-MM-DDTHH:MM:SS [########] [cpu##] [8ce96737 COMP scrubComponent VMDISK] DOMTraceOpTookTooLong:####: {'op': 0x45a46145dc00, 'objUuid': '########-####-####-####-############', 'offset-39': ###########, 'length-25': 65536, 'totalTimeMS': 2212, 'timeInThisPhaseMS': 8, 'opPhase': 'Wait for LSOM'}

 

BA rescan, vSAN will also rescan the disk group. In situations where the rescan is issued against the entire cluster, latency can reach multiple seconds and intermittently persist for hours.

Example of a vSAN rescan thread running in vmkernel.log (your dates, times, and disk details will differ):

YYYY-MM-DDTHH:MM:SSZ cpu70:2101085)PLOG: PLOGInitAndAnnounceMD:8473: Successfully announced VSAN MD (naa.0000000000000000:1) with UUID: ########-####-####-####-############. kt 1, en 0, enC 0.
YYYY-MM-DDTHH:MM:SSZ cpu36:2101085)PLOG: PLOGOpenDevice:4464: Disk handle open failure for device naa.58ce38ee203eb7e9:2, status:Busy
YYYY-MM-DDTHH:MM:SSZ cpu36:2101085)PLOG: PLOGInitAndAnnounceMD:8473: Successfully announced VSAN MD (naa.0000000000000000:1) with UUID: ########-####-####-####-############ kt 1, en 0, enC 0.
YYYY-MM-DDTHH:MM:SSZ cpu32:71487898)VC: 4616: Device rescan time 23353 msec (total number of devices 5)
YYYY-MM-DDTHH:MM:SSZ cpu32:71487898)VC: 4619: Filesystem probe time 14 msec (devices probed 3 of 5)
YYYY-MM-DDTHH:MM:SSZ cpu32:71487898)VC: 4621: Refresh open volume time 26644 msec



Environment

VMware vSAN 6.7.x
VMware vSAN 7.0.x

Cause

A vSAN process run into lock contention, which interferes with I/O until the first process is complete.

Resolution


For 6.7 builds, this issue is resolved in ESXi/vSAN 6.7 P04 (Build # 17167734) and above.
For 7.0 builds, this issue is resolved in ESXi/vSAN 7.0 U1C/P02 (Build # 17325551) and above.

Workaround:
Do not issue any manual HBA rescans, as this will trigger the problem. Patch to ESXi/vSAN 6.7 P04 (Build # 17167734) or above as soon as possible.

Additional Information

Impact/Risks:
This issue can have a severe impact on guest operations, depending on the length of time the latency spike persists and how high the latency reaches. Advise patching to vSAN build 6.7 P04 or above as soon as possible.