ESXi Host PSOD or Virtual Machine crash after VMs vMotion due to a race condition
search cancel

ESXi Host PSOD or Virtual Machine crash after VMs vMotion due to a race condition

book

Article ID: 318710

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

ESXi hosts on versions 7.0 U2/ 7.0 U3/ 8.0 GA may experience a Purple Screen of Death with a backtrace similar to these:

A)
   [0x4200331b864b]VmMemCowPShareRemoveWithCheck@vmkernel#nover+0xe7
   [0x4200331be07b]VmMemRemap_Page@vmkernel#nover+0x4e4
   [0x420033160bff]AsyncRemapProcessVMRemapList@vmkernel#nover+0x2a8
   [0x4200331610fd]AsyncRemapProcessRemapListWrapper@vmkernel#nover+0x2d2
   [0x42003311423a]VmAssistantProcessTasks@vmkernel#nover+0x14b
   [0x4200333b1871]CpuSched_StartWorld@vmkernel#nover+0x86
   [0x4200330c4a5f]Debug_IsInitialized@vmkernel#nover+0xc

B)
   [0x4180205c8fc4]VmMemZipAllocPage@vmkernel#nover+0x21c
   [0x4180205ca8dc]VmMemZip_RemapMpn@vmkernel#nover+0x3a1
   [0x4180205b4c4c]VmMemRemap_RemapZipMPN@vmkernel#nover+0x29
   [0x418020570038]LPageSelectLPageToDefrag@vmkernel#nover+0x389
   [0x4180205212d6]VmAssistantProcessTasks@vmkernel#nover+0x13f
   [0x418020710dda]CpuSched_StartWorld@vmkernel#nover+0x77

C)
   [0x4200235beb45]VmMemZipListRemove@vmkernel#nover+0x75
   [0x4200235bebcf]VmMemZipRemoveFromFreeList@vmkernel#nover+0x34
   [0x4200235c0349]VmMemZip_MarkMpnForRemapIfZipMpn@vmkernel#nover+0x15e
   [0x4200235a804a]VmMemRemap_RemapZipMPN@vmkernel#nover+0xf
   [0x420023554f17]AsyncRemap_AddOrRemapVM@vmkernel#nover+0xa4
   [0x42002355d9d0]LPageSelectLPageToDefrag@vmkernel#nover+0x40d
   [0x4200235086d8]VmAssistantProcessTasks@vmkernel#nover+0x13d
   [0x420023769481]CpuSched_StartWorld@vmkernel#nover+0x82

D)
   [0x4200109bf0eb]VmMemZipAllocPage@vmkernel#nover+0x20f
   [0x4200109bf5f1]VmMemZip_CompressPage@vmkernel#nover+0xb6
   [0x4200109a96e9]VmMemIOTryToCompress@vmkernel#nover+0xae
   [0x4200109a9c0c]VmMemIOFilterSwapPage@vmkernel#nover+0x13d
   [0x4200109aafe4]VmMemIO_SelectCandidatePages@vmkernel#nover+0x191
   [0x420010982672]SwapSelectCandidatePages@vmkernel#nover+0x57
   [0x420010982ec4]SwapVMKSwapOrRetry@vmkernel#nover+0x341
   [0x4200108d1e91]HelperQueueFunc@vmkernel#nover+0x29e
   [0x420010b69481]CpuSched_StartWorld@vmkernel#nover+0x82
   [0x4200108be69f]Debug_IsInitialized@vmkernel#nover+0xc

Virtual Machine crash :

  1. A VM may panic with the message "Compressed page has an invalid length for BPN" and/or backtrace. This is followed by an ESXi Purple Screen of Death with a backtrace similar to one below it.
Virtual Machine backtrace:
    [0x420038d31e2d]WorldPanicWork@vmkernel#nover+0x8d
    [0x420038d320dd]World_Panic@vmkernel#nover+0x142
    [0x420038dbbbe8]VmMemIO_FaultCompressedPage@vmkernel#nover+0x21d
    [0x420038dcab88]VmMemPfCompressed@vmkernel#nover+0x5d
    [0x420038dcbbaa]VmMemPfInt@vmkernel#nover+0x3ab
    [0x420038dcc1fe]VmMemPf@vmkernel#nover+0x87

Note: The virtual machine backtrace can be found through ESXi CLI in directory /var/run/log

ESX Purple Screen of Death:
    [0x420038d74768]PFrame_LookupSafe@vmkernel#nover+0x34
    [0x420038d748b9]PFrame_GetSafe@vmkernel#nover+0xe
    [0x420038d5e6f9]AllocDeallocInt@vmkernel#nover+0x2d2
    [0x420038d5e7f8]Alloc_Dealloc@vmkernel#nover+0x21
    [0x420038fb3b37]MemSched_VMMWorldCleanup@vmkernel#nover+0x4c
    [0x420038cde2c2]InitTable_Cleanup@vmkernel#nover+0x27
    [0x420038d36a70]World_TryReap@vmkernel#nover+0x385
    [0x420038d00687]ReaperWorkerWorld@vmkernel#nover+0xd8
    [0x420038f9b689]CpuSched_StartWorld@vmkernel#nover+0x86
    [0x420038cc5b57]Debug_IsInitialized@vmkernel#nover

    OR
    [0x42001f1bd949]VmMemZipListRemove@vmkernel#nover+0x75
    [0x42001f1bd9d3]VmMemZipRemoveFromFreeList@vmkernel#nover+0x34
    [0x42001f1bdcca]VmMemZipFreePage@vmkernel#nover+0x1f7
    [0x42001f1bed62]VmMemZip_FreePage@vmkernel#nover+0x3f
    [0x42001f167b4e]PFrame_Dealloc@vmkernel#nover+0x103
    [0x42001f151ca1]AllocDeallocInt@vmkernel#nover+0x2e2
    [0x42001f151d84]Alloc_Dealloc@vmkernel#nover+0x21
    [0x42001f37e67b]MemSched_VMMWorldCleanup@vmkernel#nover+0x4c
    [0x42001f0d52aa]InitTable_Cleanup@vmkernel#nover+0x27
    [0x42001f12b7e2]World_TryReap@vmkernel#nover+0x327
Virtual machine(s) (may be the crashing VM or any other) got migrated to the impacted ESXi Host. This can be checked from the ESXi Host /var/run/log/vmkernel.log, you may see messages similar to the below:

2022-12-09T06:15:43.129Z cpu25:2113460)Migrate: 312: vmotion: Dest vmmLeaderID = 2113461, ts = 2685151509724111192, srcIP = <xx.xx.xx.xx> dstIP = <xx.xx.xx.xx> Dest wid = 0 using SHARED swap, encrypted
2022-12-09T06:15:51.279Z cpu20:2113461)Migrate: 102: 2685151509724111192 D: MigrateState: Complete


Environment

VMware vSphere ESXi 7.0.3
VMware vSphere ESXi 8.0.0
VMware vSphere ESXi 7.0.2

Cause

  • Issue is impacting: ESXi 7.0 U2, 7.0 U3 and 8.0 GA.
  • Any Virtual Machine was migrated to an ESXi Host that is overcommitted or the VM being migrated (vMotion) had a sched.mem.max setting, that leads to swapping on the ESXi Host
Recent performance optimizations for vMotion have introduced a race condition between pre-validation and swapping that may lead to memory corruption. The corruption may manifest into different types of crashes including VMkernel PSOD and guest BSOD. Some common but not exhaustive list of VMkernel PSOD backtraces observed are given above.

Resolution



Workaround:
To work around this issue, sched.mem.migPreval.enable attribute needs to be disabled on all virtual machines . The value is enabled by default.

Note: Once the issue is encountered , please reboot the ESXi host then disable the sched.mem.migPreval.enable value before migrating any VMs to the ESXi Host.

To disable the value manually on one virtual machine

  1. Ensure that the virtual machine is shutdown and is powered off.
  2. Right-click the virtual machine and click Edit Settings.
  3. Click the VM Options tab.
  4. Select the Advanced.
  5. Click the Edit Configuration button.
  6. From the Configuration Parameters window, click Add Configuration Params.
  7. In the Name field, enter the parameter name and value as below:
    sched.mem.migPreval.enable = "FALSE"

To disable the value on all virtual machines on the ESXi Host

  1. Connect to the ESXi Host command line using the root user via SSH connection.
  2. Navigate to /etc/vmware/ using cd command. # cd /etc/vmware
  3. Backup the config file using the below command: # cp config config.bak
  4. Edit the config file using file editor using the command: vi config
  5. Press i key to enter Insert mode.
  6. Add the line sched.mem.migPreval.enable = "FALSE" at the end of the file.
  7. Press ESC key to exit Insert mode.
  8. Press :wq to save and exit.

Alternatively, ensure that the destination host is not overcommitted during migration and also the VM being migrated does not have non default value of sched.mem.max set to a value less than memSize