ESXi host running NSX-T 4.0.0, 4.0.1 4.1.0 or VMC version 1.20v1/v2/v3 may experience a PSOD in presence of DFW L7 Context Profile attached with a FQDN attribute
search cancel

ESXi host running NSX-T 4.0.0, 4.0.1 4.1.0 or VMC version 1.20v1/v2/v3 may experience a PSOD in presence of DFW L7 Context Profile attached with a FQDN attribute

book

Article ID: 325162

calendar_today

Updated On:

Products

VMware NSX Networking VMware Cloud on AWS VMware Cloud on Dell EMC

Issue/Introduction

To help identify when/if an ESXi PSOD was caused by this known issue. 

Symptoms:

PSOD can occur when traffic hits the NSX DFW rule which has a context profile associated with FQDN attributes and receives CNAME record in response from DNS server.
PSOD can occur during the vMotion of a VM that has NSX DFW rule which has a context profile associated with FQDN attributes and receives CNAME record in response from DNS server.


Stack trace observed during vMotion:

<DATE>T<TIME>Z cpu6:2248766)@BlueScreen: #PF Exception 14 in world 2248766:NetWorld-VM- IP 0x420010e4a31e addr 0x12
PTEs:0x175fa0027;0x1e571c007;0x0;
<DATE>T<TIME>Z cpu6:2248766)Code start: 0x42000f40xxxx VMK uptime: 1:01:58:02.637
<DATE>T<TIME>Z cpu6:2248766)0x453951a9xxxx:[0x420010e4xxxx]pf_fqdn_uuid_tree_RB_NEXT@ com.vmware.vsip#1.0.7.0.21376387+0xe stack: 0x453951a999b8
<DATE>T<TIME>Z cpu6:2248766)base fs=0x0 gs=0x420041800000 Kgs=0x0
<DATE>T<TIME>Z cpu1:2101580)Failed to backup ConfigStore.
<DATE>T<TIME>Z cpu13:2097556)Jumpstart plugin petronas-wipe-partitions activation failed.
<DATE>T<TIME>Z cpu6:2248766)CPU model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, FMS: 06/4f/1, uCodeRev: b000040

Stack trace observed without vMotion:

Screen: Spin count exceeded - possible deadlock
<DATE>T<TIME>Z cpu0:66983194)Code start: 0x420030800000 VMK uptime: 41:03:22:04.411
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ad70:[0x420030910c0d]PanicvPanicInt@vmkernel#nover+0x1f9 stack: 0x10
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ae20:[0x420030911274]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453a5e99ae80
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ae80:[0x4200308240e4]Lock_CheckSpinCount@vmkernel#nover+0x269 stack: 0x420040000000
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99aed0:[0x420030916500]MCSLockSpin@vmkernel#nover+0x71 stack: 0x4323d820dd18
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af00:[0x4200309166d4]MCSLockRWContended@vmkernel#nover+0x1c1 stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af50:[0x420030916e59]MCS_DoAcqReadLockWithRA@vmkernel#nover+0x82 stack: 0x453a5e99b228
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af60:[0x420030835041]vmk_SpinlockReadLock@vmkernel#nover+0x16 stack: 0x800000002
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af70:[0x420032246001]pf_test@ com.vmware.vsip#1.0.7.0.20682517+0x34d2 stack: 0x45bcc2aa83ba
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99b190:[0x4200322cc22f]PFFilterPacket@ com.vmware.vsip#1.0.7.0.20682517+0x50c stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99b4a0:[0x4200321ec6ff]VSIPDVFProcessPacketsInt@ com.vmware.vsip#1.0.7.0.20682517+0x4c8 stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bb70:[0x42003150f0e0]DVFilterInputOutputIOChainCB@ com.vmware.vmkapi#v2_10_0_0+0x89 stack: 0x43064204e108
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bbb0:[0x420030a53593]IOChain_Resume@vmkernel#nover+0x258 stack: 0x430600000001
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bc50:[0x420030a972be]Port_InputResume@vmkernel#nover+0x93 stack: 0x4306d4a06e00
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bca0:[0x420030a9b537]PortClient_InputCommitted@vmkernel#nover+0x34 stack: 0x4306d4a068c0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bcf0:[0x420030a4c18d]E1000DevAsyncTx@vmkernel#nover+0x53e stack: 0x4306f0e03e00
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bf50:[0x420030a818e1]NetWorldPerVMCB@vmkernel#nover+0x19e stack: 0x430113e9b750
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bfe0:[0x420030c14c52]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99c000:[0x4200308d408f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

Stack trace observed without vMotion:


<DATE>T<TIME>Z cpu62:2113087)@BlueScreen: #PF Exception 14 in world 2113087:NetWorld-VM- IP 0x42002466957f addr 0x1a
PTEs:0x0;
<DATE>T<TIME>Z cpu62:2113087)Code start: 0x420022c00000 VMK uptime: 0:02:54:14.349
<DATE>T<TIME>Z cpu62:2113087)0x453a5a018b68:[0x42002466957f]<fqdn>+0xf stack: 0x453a5a0199b8
<DATE>T<TIME>Z cpu62:2113087)base fs=0x0 gs=0x42004f800000 Kgs=0x0
<DATE>T<TIME>Z cpu62:2113087)CPU model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz, FMS: 06/55/7, uCodeRev: 5003302
<DATE>T<TIME>Z cpu62:2113087)PRODUCTNAME:Amazon EC2 i3en.metal-2tb, VENDORNAME:Amazon EC2, SERIAL_NUMBER:i-0aba838bce54f68b7, SERVER_UUID:<UUID>, VERSION:, SKU:, FAMILY:


Environment

VMware NSX 4.0.0.1

Cause

NSX DFW context profile has a configuration related to FQDN attribute and receives a CNAME record in response from DNS server. When traffic hits this rule or if a VM associated with this rule vMotions the host experiences memory corruptions in the DFW which leads to PSOD

Here is the sample configuration for L7 Context profile with FQDN attribute.

 image.png
 

 
 

Resolution

NSX Advanced Firewall Activation has been temporarily disabled for VMC version 1.20v1/v2/v3.  This issue has been resolved in VMC version 1.20v4. 

Workaround:
Disable the firewall rule which has DFW FQDN attribute configurations

Additional Information

Impact/Risks:
ESXi host encounter PSOD.
 
Impacted version:
VMC : 1.20v1, 1.20v2 and 1.20v3
OnPrem – NSX-T Data Center version 4.0.0, 4.0.1, 4.0.1.1 and 4.1.0