PSOD on ESXi host with- NFSSched_DestroySchedQueue
search cancel

PSOD on ESXi host with- NFSSched_DestroySchedQueue

book

Article ID: 372259

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • PSOD with "NFSSched_DestroySchedQueue" on ESXi Host
  • These are the log entries in the vmkernel-zdump.log file

2024-07-09T04:05:21.283Z cpu39:2916506)NFS41: NFS41SetSchedQueuePolicy:3056: Mismatch! sched worldID:2916341 worldID:2916341. schedWorld:0x4317b085a880 schedpolicy:0x4538e659bdf0
2024-07-09T04:05:21.355Z cpu39:2916506)World: 3072: PRDA 0x420049c00000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0
2024-07-09T04:05:21.355Z cpu39:2916506)World: 3074: TR 0xf58 GDT 0x45384004e000 (0xf77) IDT 0x420018950000 (0xfff)
2024-07-09T04:05:21.355Z cpu39:2916506)World: 3075: CR0 0x80010031 CR3 0x20ea3db000 CR4 0x142768
2024-07-09T04:05:21.392Z cpu39:2916506)Backtrace for current CPU #39, worldID=2916506, fp=0x4317b085a880
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bc88:[0x420018905530]MCSLockWork@vmkernel#nover+0x8 stack: 0xa4435f7, 0x431a22e01240, 0x420019ef6f95, 0x4308fac10450, 0x420019ee9e5c
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bc90:[0x420019efaa9f]NFSSched_DestroySchedQueue@(nfsclient)#<None>+0x1c stack: 0x431a22e01240, 0x420019ef6f95, 0x4308fac10450, 0x420019ee9e5c, 0x430e4ca01660
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bcb0:[0x420019ef6f94]NFSVolume_DestroySchedQHandle@(nfsclient)#<None>+0x11 stack: 0x430e4ca01660, 0x0, 0x1, 0x420018ceaccd, 0x430e4ca01aa0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bcc0:[0x420019ee9e5b]NFSOpCloseFile@(nfsclient)#<None>+0xe0 stack: 0x1, 0x420018ceaccd, 0x430e4ca01aa0, 0x0, 0x1
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bd90:[0x42001883b564]FSSVec_CloseFile@vmkernel#nover+0x1d stack: 0x4308fac10544, 0x420018840990, 0x149c05610, 0x420000000000, 0x420049c05ab0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bda0:[0x4200188373dd]FSS_DoCloseFile@vmkernel#nover+0x6e stack: 0x149c05610, 0x420000000000, 0x420049c05ab0, 0x0, 0xa4435f7
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bdb0:[0x42001884098f]BC_CloseFile@vmkernel#nover+0x70 stack: 0x420049c05ab0, 0x0, 0xa4435f7, 0x4308fac10450, 0x0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659be00:[0x4200188376ca]FSS_CloseFile@vmkernel#nover+0x87 stack: 0x10, 0x4308f3dd9dd0, 0x4308fac10450, 0x45d9022da868, 0x45d9025b37b0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659be50:[0x420018cd8dc4]UserVmfs_Close@vmkernel#nover+0x35 stack: 0xa, 0x45d9025b37b0, 0xa, 0x420018cb936c, 0x430f77002010
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659be80:[0x420018cb936b]UserObj_ReleaseWithoutCartel@vmkernel#nover+0x10 stack: 0xa, 0x420018cbb754, 0x176, 0x430f7701a7f0, 0x0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bea0:[0x420018cbb753]UserObj_FDClose@vmkernel#nover+0x178 stack: 0x0, 0x45d9025b37b0, 0x430f77002010, 0x4538e659bf40, 0x3
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bf00:[0x420018d07e86]LinuxFileDesc_Close@vmkernel#nover+0x1b stack: 0xceb99b1c0, 0x4538e659bfd0, 0x0, 0x0, 0x0
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bf10:[0x420018cb4863]User_LinuxSyscallHandler@vmkernel#nover+0x1a4 stack: 0x0, 0x0, 0x0, 0x42001894e068, 0x10b
2024-07-09T04:05:21.392Z cpu39:2916506)0x4538e659bf40:[0x42001894e067]gate_entry@vmkernel#nover+0x68 stack: 0x0, 0x3, 0xce6abdfcd, 0xca5030310, 0xca504e708
2024-07-09T04:05:21.416Z cpu39:2916506)ESC[45mESC[33;1mVMware ESXi 7.0.3 [Releasebuild-22348816 x86_64]ESC[0m
#PF Exception 14 in world 2916506:vmx-vcpu-0:v IP ######### addr ######

 

  • Below is the PSOD backtrace 

#0 Atomic_Read16 (var=0x9e) at bora/public/vm_atomic.h:2792
#1 MCSTryLockCommon (lock=0x9c) at bora/vmkernel/main/mcslock.c:1160
#2 MCSLockCommonInt (ra=0x0, lock=0x9c) at bora/vmkernel/main/mcslock.c:2223
#3 MCSLockWork (lock=lock@entry=0x9c) at bora/vmkernel/main/mcslock.c:2305
#4 0x0000420019efaaa0 in MCS_Lock (lock=0x9c) at bora/vmkernel/private/mcslock.h:261
#5 NFSSched_DestroySchedQueue (schedQ=0x4317b085a880) at bora/modules/vmkernel/nfsclient/nfsSched.c:2549
#6 0x0000420019ef6f95 in NFSVolume_DestroySchedQHandle (mpe=mpe@entry=0x431a22e01240, fhID=fhID@entry=172242423, schedQHandle=<optimized out>) at bora/modules/vmkernel/nfsclient/nfsVolume.c:4469
#7 0x0000420019ee9e5c in NFSOpCloseFile (file=0x4308fac10450, fhID=172242423) at bora/modules/vmkernel/nfsclient/nfsClient.c:4500
#8 0x000042001883b565 in FSSVec_CloseFile (desc=<optimized out>, fhID=<optimized out>) at bora/vmkernel/filesystems/fsSwitchVec.c:459
#9 0x00004200188373de in FSS_DoCloseFile (fileDesc=fileDesc@entry=0x4308fac10450, fhid=fhid@entry=172242423, openFlags=<optimized out>, openFlags@entry=1, failedOpen=failedOpen@entry=0 '\000') at bora/vmkernel/filesystems/fsSwitch.c:4052
#10 0x0000420018840990 in BC_CloseFile (desc=0x4308fac10450, fhid=172242423, openFlags=1, failedOpen=<optimized out>) at bora/vmkernel/filesystems/caches/bufferCache2.c:3194
#11 0x00004200188376cb in FSSCloseFile (failedOpen=0 '\000', openFlags=<optimized out>, fhid=172242423, fileDesc=0x4308fac10450) at bora/vmkernel/filesystems/fsSwitch.c:4258
#12 FSS_CloseFile (fileHandleID=172242423) at bora/vmkernel/filesystems/fsSwitch.c:4259
#13 0x0000420018cd8dc5 in UserVmfs_Close (obj=0x45d9025b37b0) at bora/vmkernel/user/userVmfs.c:2091
#14 0x0000420018cb936c in UserObj_ReleaseWithoutCartel (obj=0x45d9025b37b0) at bora/vmkernel/user/userObj.c:2141
#15 0x0000420018cbb754 in UserObj_ReleaseWithoutCartel (obj=<optimized out>) at bora/vmkernel/user/userObj.c:4591
#16 UserObj_Release (obj=<optimized out>, uci=0x430f77002010) at bora/vmkernel/user/userObj.c:2115
#17 UserObj_FDClose (uci=0x430f77002010, fd=<optimized out>) at bora/vmkernel/user/userObj.c:4606
#18 0x0000420018d07e87 in LinuxFileDesc_Close (fd=<optimized out>) at bora/vmkernel/user/linuxFileDesc.c:1081
#19 0x0000420018cb4864 in User_LinuxSyscallHandler (fullFrame=0x4538e659bf40) at bora/vmkernel/user/user.c:2057
#20 0x000042001894e068 in gate_entry ()
#21 0x0000000ce6abdfcd in ?? ()

Environment

ESXi 7.0.x

Cause

VMware by Broadcom is aware of this issue and is working on a fix.

Resolution

  • The reported issue will be fixed on ESXi 7.0.3 P10

 

  • We will need to perform below steps on all the ESXi host that are using NFS4.1 to disable nfs41's file based scheduler:

 

SSH to the ESXI host and run this command:

# esxcli system module parameters set -m nfs41client -p fileBasedScheduler=0


Then run this command to check the nfs41's file based scheduler is disabled


# esxcli system module parameters list -m nfs41client


------------------ ---- ----- -----------
fileBasedScheduler bool 0 Enable/Disable file based scheduler for NFSv41 (default: 1)

 

Then reboot the ESXi host.


Perform this action on all the host with NFS 4.1 mounted datastores.

 

Additional Information

What will be the impact of disabling File based scheduler?


With FBS, one can set policies (IOPS throughput etc) per vmdk of a VM.

Without FBS, policy are per VM, per datastore.
For example: VM with vmdk1 and vmdk2 on NFS1
                      vmdk3 and vmdk4 on NFS2
                      Assume policy of 100iops are set per vmdk.

         With FBS: NFS allows only 100 IOPS for each of the above vmdks.(per VM per vmdk)
         Without FBS: vmdk1 and vmkdk2 get cumulative of 200 IOPS. It can happen vmdk1 gets 150 IOPS while vmdk2 get 50 IOPS.
                   Similarly on vmdk3 and vmdk4 also.

If you do not have such policies on vmdks then these changes have no impact.