PSOD is observed on NSX prepared ESXi hosts
search cancel

PSOD is observed on NSX prepared ESXi hosts

book

Article ID: 320289

calendar_today

Updated On:

Products

VMware NSX VMware vSphere ESXi

Issue/Introduction

  • ESXi experiences a PSOD with below backtrace:

    @BlueScreen: #PF Exception 14 in world 4002061:NetWorld-VM-IP 0x42002#####19 addr 0x3c
    Code start: 0x4200#####00 VMK uptime: ######
    0x4539#####78:[0x42002#####19]PktList_SplitByUplinkPort@vmkernel#nover+0x9 stack: 0x0
    0x4539c#####80:[0x42002#####17]PktList_IOCompleteLocked@vmkernel#nover+0xdc stack: 0x0
    0x4539c#####f0:[0x42002#####ba]Portset_ProcessAllDeferred@vmkernel#nover+0x2b stack: 0x600##2e
    0x4539c#####10:[0x42002#####ac]Port_ReleaseNonexcl@vmkernel#nover+0x13d stack: 0x1
    0x4539c#####50:[0x42002#####5f]NetWorldPerVMCB@vmkernel#nover+0x1a8 stack: 0x4301#####90
    0x4539c#####e0:[0x42002#####ca]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0
    0x4539c#####00:[0x42002#####8b]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
  • NSX latency feature is configured on pNICs with offload constraints of maximum Geneve option length less than 40 bytes, such as QLogic FastLinQ QL41xxx
  • Hosts transport nodes are configured with IPv6 TEP however the pNIC can only support inner checksum/TCP segmentation offload with IPv4 as outer, but not support IPv6 as outer. Such as QLogic FastLinQ QL41xxx, Intel Fortville X710/XXV710/XL710.
  • Below core files would be noticed under /var/core :
    net-vdr-zdump.#  vmkernel-zdump.#

Environment

VMware NSX
ESXi

Cause

There are several conditions that can cause this issue:

  • The pNIC supports Geneve offload with constraints
  • The Outer UDP randomization is requested
  • Packet is read-only
  • Packet exceeds pNIC's offload constraints

Resolution

This issue is resolved in VMware ESXi version 8.0 U1c (build 22088125) and later versions, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.


NOTE: NSX also has developed a fix that disables outer UDP source port randomization for x86 hosts. The fix is available in NSX 4.1.1 or later.

Workaround:

  • Disable Latency Feature
    • This has the least impact, but isn't a complete workaround. If the issue is occurring still, then that means another component is exceeding the constraints of the packet size.
  • Disable Geneve Offload
    • Disabling Geneve offload on the physical NIC prevents the issue by restricting the offload constraints even further
      • Pros:
        • It can prevent same issue being triggered by other offload constraints, such as IPv6 outer and inner L7 offset.
        • Features like LAT can still be enabled.
      • Cons:

Additional Information