PSOD can occur when traffic hits the NSX DFW rule which has a context profile associated with FQDN attributes and receives CNAME record in response from DNS server.
PSOD can occur during the vMotion of a VM that has NSX DFW rule which has a context profile associated with FQDN attributes and receives CNAME record in response from DNS server.
Stack trace observed during vMotion:
<DATE>T<TIME>Z cpu6:2248766)@BlueScreen: #PF Exception 14 in world 2248766:NetWorld-VM- IP 0x420010e4a31e addr 0x12
PTEs:0x175fa0027;0x1e571c007;0x0;
<DATE>T<TIME>Z cpu6:2248766)Code start: 0x42000f40xxxx VMK uptime: 1:01:58:02.637
<DATE>T<TIME>Z cpu6:2248766)0x453951a9xxxx:[0x420010e4xxxx]pf_fqdn_uuid_tree_RB_NEXT@ com.vmware.vsip#1.0.7.0.21376387+0xe stack: 0x453951a999b8
<DATE>T<TIME>Z cpu6:2248766)base fs=0x0 gs=0x420041800000 Kgs=0x0
<DATE>T<TIME>Z cpu1:2101580)Failed to backup ConfigStore.
<DATE>T<TIME>Z cpu13:2097556)Jumpstart plugin petronas-wipe-partitions activation failed.
<DATE>T<TIME>Z cpu6:2248766)CPU model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, FMS: 06/4f/1, uCodeRev: b000040
Stack trace observed without vMotion:
Screen: Spin count exceeded - possible deadlock
<DATE>T<TIME>Z cpu0:66983194)Code start: 0x420030800000 VMK uptime: 41:03:22:04.411
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ad70:[0x420030910c0d]PanicvPanicInt@vmkernel#nover+0x1f9 stack: 0x10
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ae20:[0x420030911274]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453a5e99ae80
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99ae80:[0x4200308240e4]Lock_CheckSpinCount@vmkernel#nover+0x269 stack: 0x420040000000
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99aed0:[0x420030916500]MCSLockSpin@vmkernel#nover+0x71 stack: 0x4323d820dd18
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af00:[0x4200309166d4]MCSLockRWContended@vmkernel#nover+0x1c1 stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af50:[0x420030916e59]MCS_DoAcqReadLockWithRA@vmkernel#nover+0x82 stack: 0x453a5e99b228
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af60:[0x420030835041]vmk_SpinlockReadLock@vmkernel#nover+0x16 stack: 0x800000002
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99af70:[0x420032246001]pf_test@ com.vmware.vsip#1.0.7.0.20682517+0x34d2 stack: 0x45bcc2aa83ba
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99b190:[0x4200322cc22f]PFFilterPacket@ com.vmware.vsip#1.0.7.0.20682517+0x50c stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99b4a0:[0x4200321ec6ff]VSIPDVFProcessPacketsInt@ com.vmware.vsip#1.0.7.0.20682517+0x4c8 stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bb70:[0x42003150f0e0]DVFilterInputOutputIOChainCB@ com.vmware.vmkapi#v2_10_0_0+0x89 stack: 0x43064204e108
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bbb0:[0x420030a53593]IOChain_Resume@vmkernel#nover+0x258 stack: 0x430600000001
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bc50:[0x420030a972be]Port_InputResume@vmkernel#nover+0x93 stack: 0x4306d4a06e00
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bca0:[0x420030a9b537]PortClient_InputCommitted@vmkernel#nover+0x34 stack: 0x4306d4a068c0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bcf0:[0x420030a4c18d]E1000DevAsyncTx@vmkernel#nover+0x53e stack: 0x4306f0e03e00
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bf50:[0x420030a818e1]NetWorldPerVMCB@vmkernel#nover+0x19e stack: 0x430113e9b750
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99bfe0:[0x420030c14c52]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0
<DATE>T<TIME>Z cpu0:66983194)0x453a5e99c000:[0x4200308d408f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
Stack trace observed without vMotion:
<DATE>T<TIME>Z cpu62:2113087)@BlueScreen: #PF Exception 14 in world 2113087:NetWorld-VM- IP 0x42002466957f addr 0x1a
PTEs:0x0;
<DATE>T<TIME>Z cpu62:2113087)Code start: 0x420022c00000 VMK uptime: 0:02:54:14.349
<DATE>T<TIME>Z cpu62:2113087)0x453a5a018b68:[0x42002466957f]<fqdn>+0xf stack: 0x453a5a0199b8
<DATE>T<TIME>Z cpu62:2113087)base fs=0x0 gs=0x42004f800000 Kgs=0x0
<DATE>T<TIME>Z cpu62:2113087)CPU model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz, FMS: 06/55/7, uCodeRev: 5003302
<DATE>T<TIME>Z cpu62:2113087)PRODUCTNAME:Amazon EC2 i3en.metal-2tb, VENDORNAME:Amazon EC2, SERIAL_NUMBER:i-0aba838bce54f68b7, SERVER_UUID:<UUID>, VERSION:, SKU:, FAMILY:
NSX DFW context profile has a configuration related to FQDN attribute and receives a CNAME record in response from DNS server. When traffic hits this rule or if a VM associated with this rule vMotions the host experiences memory corruptions in the DFW which leads to PSOD
Here is the sample configuration for L7 Context profile with FQDN attribute.
NSX Advanced Firewall Activation has been temporarily disabled for VMC version 1.20v1/v2/v3. This issue has been resolved in VMC version 1.20v4.
Workaround:
Disable the firewall rule which has DFW FQDN attribute configurations
Impact/Risks:
ESXi host encounters PSOD.
Impacted version:
VMC : 1.20v1, 1.20v2 and 1.20v3
OnPrem – NSX-T Data Center version 4.0.0, 4.0.1, 4.0.1.1 and 4.1.0
Fixed Versions:
OnPrem – NSX-T Data Center versions 4.1.1, 4.1.2, 4.1.2.1, 4.1.2.3, 4.1.2.4, and 4.1.2.5