Dataplane is not able to start after increasing ring buffer size
search cancel

Dataplane is not able to start after increasing ring buffer size

book

Article ID: 314225

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • After increasing ring buffer size to 4096 on BM edge, dataplane service cannot start successfully.
  • when heap memory is exhausted, enter and exit maintenance is triggered , systemd restarts all edge services.

Log entries

2024-02-05T00:22:00.699Z be06-eg611-krw2 NSX 1917610 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="ERROR" errorCode="EDG0400711"] rte malloc_heap is exhausted (100% is used)
2024-02-05T00:22:00.699Z be06-eg611-krw2 NSX 1917610 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="INFO"] trigger enter and exit maintenance mode
2024-02-05T00:22:00.814Z be06-eg611-krw2 124ed474ff88 1914309 - - 2024-02-05T00:22:00Z datapathd 1917610 stats tname="stats59" [ERROR] rte malloc_heap is exhausted (100% is used) errorCode="EDG0400711"

Environment

VMware NSX-T Data Center 3.x
VMware NSX 4.x

Cause

  • This happens due to lack of heap memory on socket 0 as ring buffer consumes larger hugepage memory.

    $ cat ./edge/memory-malloc-heap
    [
        {
            "Alloc_count": 37373,
            "Alloc_size": 34357822400,
            "Free_count": 1439,
            "Free_size": 1915968, >>>
            "Greatest_free_size": 15232,
            "Heap id": 0,
            "Heap name": "socket_0",
            "Heap_size": 34359738368 >>>
        },

Resolution

  • NSX 3.2.3.2 and 4.1.1 start supporting 128GB hugepage memory for BME as compared to 64GB in earlier versions and this issue will not be seen with 4096 ring buffer.


    Workaround:
    For other NSX versions, heap memory utilization needs to be monitored and it may be necessary to decrease ring buffer size to either 2048 or 1024, if issue is observed.

Additional Information

Impact/Risks:
If rte_heap_memory is exhausted, edge triggers enter MM and exit MM, and systemd restarts all edge services. This operation is trying to mitigate rte_heap_memory exhaustion impact which depends on the amount of memory still available, and the configuration of the edge.


When a few percentage of memory is still available, most operations will still work fine. Datapath packet forwarding does not use the rte_heap, so it will continue to work. However, configuration changes and state synchronization may use the heap and may start to fail for services like firewall or LB.