ESXi Host PSOD Triggered by L7 DFW FQDN Rules Due to Corruption of FQDN domain_list Entries
search cancel

ESXi Host PSOD Triggered by L7 DFW FQDN Rules Due to Corruption of FQDN domain_list Entries

book

Article ID: 425030

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

ESXi hosts may experience a PSOD when NSX DFW L7 rules using FQDN-based context profiles are configured.

  • PSOD stack trace references VSIP / PF / FQDN-related code paths.

    (gdb) bt
    #0  memscan (c=0, no=<optimized out>, ni=<optimized out>, so=<optimized out>, si=0x************) at …/nsx_libc_memscan.h:****
    #1  strlen (s=0x************) at …/nsx_libc_string.h:****
    #2  pfa_get_counts (kif=0x************, nAttr=0x************, attrLen=0x************) at …/pf_attribute.c:****
    #3  0x**************** in pfioctl (kif=0x************, dev=<optimized out>, cmd=<optimized out>, addr=<optimized out>, flags=<optimized out>, td=<optimized out>) at …/pf_ioctl.c:****
    #4  0x**************** in VSIPConversionGetConnectionStat (kif=0x************, msg=0x************, msgLen=<optimized out>, result=0x************) at …/msg2pf.c:****
    #5  0x**************** in VSIPToPFIoctl (cookie=<optimized out>, cmd=<optimized out>, data=<optimized out>, dataLen=<optimized out>, result=0x************) at …/msg2pf.c:****
    #6  0x**************** in VSIPFlowGetRecordsForFilter (sol=<optimized out>, filter=0x************, data=0x************) at …/vsip_flow.c:****
    #7  0x**************** in VSIPDVFConfigFilterByName (fpAgentName="vmware-sfw", filterName=<optimized out>, iter=<optimized out>, data=0x************) at …/vsip_dvfilter.c:****
    #8  0x**************** in VSIPFlowGetRecords (fpAgentName="vmware-sfw", dvsPort=<optimized out>, cmdId=<optimized out>, buffer=<optimized out>, bufLen=<optimized out>, includeAttrs=1) at …/vsip_flow.c:****
    #9  0x**************** in VSIPIoctlFlowData (cmd=<optimized out>, iocData=0x************, fnData=<optimized out>, result=<optimized out>) at …/vsip_fw_ioctl.c:****
    #10 0x**************** in VSIPIoctlImpl (cmd=<optimized out>, req=0x************, result=0x************) at …/vsip_ioctl.c:****
    #11 0x**************** in VSIPCharDevIoctl (cmd=<optimized out>, userData=<optimized out>, result=0x************) at …/vsip_dev.c:****
    #12 0x**************** in VMKAPICharDevIoctl (handle=0x************, userData=<optimized out>, cmd=<optimized out>) at …/vmkapi_char.c:****
    #13 0x**************** in VMKAPICharDevDevfsWrapIoctl (handle=0x************, cmd=<optimized out>, userData=<optimized out>, ioctlResult=0x************) at …/vmkapi_char.c:****
    #14 0x**************** in CharDriverIoctl (deviceHandleID=<optimized out>, cmd=<optimized out>, dataInOut=0x************) at …/charDriver.c:****
    #15 0x**************** in FDS_Ioctl (dataInOut=0x************, cmd=FDS_IOCTL_PASS_THRU, fdsHandle=<optimized out>) at …/fsDeviceSwitch.h:****
    #16 0x**************** in DevFSIoctl (fileDesc=0x************, fhID=<optimized out>, cmd=IOCTLCMD_DEVFS_OPAQUE, dataIn=0x************, result=0x************) at …/devfs.c:****
    #17 0x**************** in FSSVec_Ioctl (desc=<optimized out>, fhID=<optimized out>, cmd=<optimized out>, dataIn=<optimized out>, result=<optimized out>) at …/fsSwitchVec.c:****
    #18 0x**************** in FSSObjectIoctlCommon (fhID=<optimized out>, file=0x************, cmd=<optimized out>, dataIn=0x************, result=0x************) at …/fsSwitch.c:****
    #19 0x**************** in FSS_IoctlByFH (fileHandleID=<optimized out>, cmd=<optimized out>, dataIn=0x************, result=0x************, ioFlags=<optimized out>) at …/fsSwitch.c:****
    #20 0x**************** in UserFile_PassthroughIoctl (vmfsObj=<optimized out>, cmd=<optimized out>, userData=<optimized out>, result=<optimized out>) at …/userFile.c:****
    #21 0x**************** in UserVmfs_Ioctl (obj=<optimized out>, cmd=<optimized out>, userArg=<optimized out>, ioctlReturnCode=<optimized out>) at …/userVmfs.c:****
    #22 0x**************** in LinuxFileDesc_Ioctl (fd=<optimized out>, cmd=<optimized out>, userData=<optimized out>) at …/linuxFileDesc.c:****
    #23 0x**************** in User_LinuxSyscallHandler (fullFrame=0x************) at …/user.c:****
    #24 0x**************** in gate_entry () at …/gates64.S:****
    #25 0x**************** in ?? ()
  • Host recovers only after reboot.

Environment

 

  • VMware ESXi 8.0 U3

  • NSX versions prior to:

    • NSX 4.2.3.3

    • NSX 9.0.2

  • Distributed Firewall with L7 FQDN-based rules / context profiles

 

Cause

 The vmkernel core dump confirms that the PSOD is caused by corruption of FQDN domain_list data structures within the NSX datapath.

  • The PSOD occurs while processing PF attribute connections associated with FQDN-based L7 DFW rules.

  • The attribute connection (ac) is marked with:

    ac_flags = 0x8 (PF_ATTR_CONN_FQDN)

    indicating that the flow is associated with an FQDN context.

  • During DNS resolution, the domain_list linked list within the fqdn_node is incorrectly updated.

  • backtrace analysis shows:

    • Invalid tqh_first and tqh_last pointers in domain_list

    • Corrupted memory addresses being passed to strlen() and memscan()

    • Resulting in a kernel crash due to access of invalid memory

Example:

domain_list = {tqh_first = ************, tqh_last = ************}

These values clearly indicate memory corruption rather than valid list pointers.

The issue can be triggered during any of the following operations on FQDN entries >> Read | Update | Delete

These operations typically occur during DNS resolution or FQDN re-evaluation.

Resolution

Workaround

Use L4 Distributed Firewall rules instead of L7 FQDN-based rules.

Specifically:

  • Disable or remove L7 DFW rules that reference FQDN context profiles

  • Replace them with equivalent IP / CIDR / L4-based rules

This avoids the affected FQDN code path and prevents corruption of the domain_list structure.

Resolution

A permanent fix has been implemented by Engineering and is available in the following NSX releases:

  • NSX 4.2.3.3

  • NSX 9.0.2

Upgrading to one of these versions resolves the issue and allows safe use of L7 FQDN-based DFW rules.