ESX hosts prepared for NSX-V 6.4.5 or NSX-V 6.4.6 VIBs may see as PSOD as shown below
Following is the stack trace of the PSOD
#0 DLM_free (msp=0x431a455dcca0, mem=mem@entry=0x431a458cbd10, allowTrim=allowTrim@entry=1 '\001') at bora/vmkernel/main/dlmalloc.c:4924
#1 0x0000418012343ffa in Heap_Free (heap=0x431a455dc000, mem=<optimized out>, mem@entry=0x431a458cbd10) at bora/vmkernel/main/heap.c:4314
#2 0x000041801222db25 in vmk_HeapFree (heap=<optimized out>, mem=mem@entry=0x431a458cbd10) at bora/vmkernel/core/vmkapi_heap.c:250
#3 0x000041801393ca61 in __VDL2_Free (heapID=<optimized out>, data=data@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2.c:152
#4 0x0000418013950caf in VDL2_CPTaskFree (task=task@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_ctlplane.c:164
#5 0x0000418013949415 in VDL2CPWorldProcessTask (task=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:283
#6 VDL2CPWorldFunc (data=data@entry=0x0) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:335
#7 0x0000418012308adf in vmkWorldFunc (data=<optimized out>) at bora/vmkernel/main/vmkapi_world.c:528
#8 0x00004180124c91f5 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:10792
#9 0x0000000000000000 in ?? ()
Following logs in ESXi host vmkernel.log indicates that BFD got enabled on the host
2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)vxlan: VDL2PortsetPropSet:1036: Updating BFD VTEP config to : enable
2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)BFD: BFD_CreateNewSession ENTER: localIP: a.b.c.d , remoteIP: w.x.y.z , probeInterval (in milli seconds): 12000
2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)WARNING: BFD: Inserted new session: Discriminator 1471713223, localIP: a.b.c.d remoteIP: w.x.y.z
Affected products
vRNI 4.2 and above
NSX 6.4.5 and above
Cause
Virtual Infrastructure Latency in NSX uses BFD protocol for end to end latency metrics computation. The PSOD occurs when NSX kernel module is responding to a BFD tunnel detailed query from the control plane agent with all the BFD sessions states that are maintained by the ESX kernel.
Current release of BFD module can handle up to 975 BFD tunnel information. In cases wherein count of BFD tunnel exceeds 975, it can result in buffer overflow, thereby corrupting vmkernel heap meta-data. This overflow is caught by ESXi vmkernel heap management subsystem and causes the PSOD. Currently, there is no resolution. VMware is aware of this issue and is working towards a resolution.
Currently, there is no resolution. VMware is aware of this issue and is working towards a resolution.
Workaround 1
In vRealize Network Insight version 4.2.0 and above, go to Settings > Accounts and Datasource. Edit the NSX Manager datasource and ensure to uncheck the option to disable “Virtual Infrastructure Latency” then click the button "Submit" to confirm the change.
Workaround 2
Using API to disable BFD configuration on the NSX Manager
Use GET API to determine BFD status
GET /api/2.0/vdn/bfd/configuration/global
<bfdGlobalConfiguration>
<enabled>true</enabled>
<pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
<bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>
Use PUT API to change BFD enable configuration
PUT /api/2.0/vdn/bfd/configuration/global
<bfdGlobalConfiguration>
<enabled>false</enabled>
<pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
<bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>