[vRNI] [4.x] ESXi hosts may experience PSOD issue after enabling ‘Virtual Infrastructure Latency’ on NSX datasource in vRealize Network Insight.
search cancel

[vRNI] [4.x] ESXi hosts may experience PSOD issue after enabling ‘Virtual Infrastructure Latency’ on NSX datasource in vRealize Network Insight.

book

Article ID: 314422

calendar_today

Updated On:

Products

VMware Aria Operations for Networks VMware NSX

Issue/Introduction


ESX hosts prepared for NSX-V 6.4.5 or NSX-V 6.4.6 VIBs may see as PSOD as shown below





Following is the stack trace of the PSOD

#0 DLM_free (msp=0x431a455dcca0, mem=mem@entry=0x431a458cbd10, allowTrim=allowTrim@entry=1 '\001') at bora/vmkernel/main/dlmalloc.c:4924
#1 0x0000418012343ffa in Heap_Free (heap=0x431a455dc000, mem=<optimized out>, mem@entry=0x431a458cbd10) at bora/vmkernel/main/heap.c:4314
#2 0x000041801222db25 in vmk_HeapFree (heap=<optimized out>, mem=mem@entry=0x431a458cbd10) at bora/vmkernel/core/vmkapi_heap.c:250
#3 0x000041801393ca61 in __VDL2_Free (heapID=<optimized out>, data=data@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2.c:152
#4 0x0000418013950caf in VDL2_CPTaskFree (task=task@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_ctlplane.c:164
#5 0x0000418013949415 in VDL2CPWorldProcessTask (task=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:283
#6 VDL2CPWorldFunc (data=data@entry=0x0) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:335
#7 0x0000418012308adf in vmkWorldFunc (data=<optimized out>) at bora/vmkernel/main/vmkapi_world.c:528
#8 0x00004180124c91f5 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:10792
#9 0x0000000000000000 in ?? ()



Following logs in ESXi host vmkernel.log indicates that BFD got enabled on the host

2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)vxlan: VDL2PortsetPropSet:1036: Updating BFD VTEP config to : enable
2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)BFD: BFD_CreateNewSession ENTER: localIP: a.b.c.d , remoteIP: w.x.y.z , probeInterval (in milli seconds): 12000
2019-10-17T00:34:01.996Z cpu75:68603 opID=6616b81b)WARNING: BFD: Inserted new session: Discriminator 1471713223, localIP: a.b.c.d remoteIP: w.x.y.z 



Affected products


vRNI 4.2 and above
NSX 6.4.5 and above



Cause

Virtual Infrastructure Latency in NSX uses BFD protocol for end to end latency metrics computation. The PSOD occurs when NSX kernel module is responding to a BFD tunnel detailed query from the control plane agent with all the BFD sessions states that are maintained by the ESX kernel.
 Current release of BFD module can handle up to 975 BFD tunnel information. In cases wherein count of BFD tunnel exceeds 975, it can result in buffer overflow, thereby corrupting vmkernel heap meta-data. This overflow is caught by ESXi vmkernel heap management subsystem and causes the PSOD. Currently, there is no resolution. VMware is aware of this issue and is working towards a resolution.


Environment

VMware vRealize Network Insight 5.x
VMware NSX for vSphere 6.4.x
VMware vRealize Network Insight 4.x

Resolution

Currently, there is no resolution. VMware is aware of this issue and is working towards a resolution.


Workaround 1
In vRealize Network Insight version 4.2.0 and above, go to Settings > Accounts and Datasource. Edit the NSX Manager datasource and ensure to uncheck the option to disable “Virtual Infrastructure Latency” then click the button "Submit" to confirm the change.




Workaround 2

Using API to disable BFD configuration on the NSX Manager

Use GET API to determine BFD status
GET /api/2.0/vdn/bfd/configuration/global

<bfdGlobalConfiguration>
      <enabled>true</enabled>
      <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
      <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>



Use PUT API to change BFD enable configuration
PUT /api/2.0/vdn/bfd/configuration/global

<bfdGlobalConfiguration>
     <enabled>false</enabled> 
     <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
     <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
</bfdGlobalConfiguration>