The purpose of this KB is to provide a way to figure out why dataplane service fails to start and how to resolve this issue.
Symptoms:
After increasing ring buffer size to 4096 on BM edge, dataplane service cannot start successfully.
VMware NSX-T Data Center
VMware NSX-T Data Center 3.x
VMware NSX-T Data Center 4.x
This happens due to lack of heap memory on socket 0 as ring buffer consumes larger hugepage memory.
$ cat ./edge/memory-malloc-heap
[
{
"Alloc_count": 37373,
"Alloc_size": 34357822400,
"Free_count": 1439,
"Free_size": 1915968, >>>
"Greatest_free_size": 15232,
"Heap id": 0,
"Heap name": "socket_0",
"Heap_size": 34359738368 >>>
},
NSX 3.2.3.2 and 4.1.1 start supporting 128GB hugepage memory for BME compared to 64GB in earlier versions, and customer won't see this issue with 4K rx/tx ring buffer.
Please refer to workaround section for other versions
Workaround:
You have to monitor if heap memory is enough and decrease ring buffer size to either 2048 or 1024.
Impact/Risks:
If rte_heap_memory is exhausted, edge triggers enter MM and exit MM, and systemd restarts all edge services. This operation is trying to mitigate rte_heap_memory exhaustion impact which depends on the amount of memory still available, and the configuration of the edge.
When a few percentage of memory is still available, most operations will still work fine. Datapath packet forwarding does not use the rte_heap, so it will continue to work. However, configuration changes and state synchronization may use the heap and may start to fail for services like firewall or LB.