Incorrect lcore allocation leads to lower dataplane throughput on CNF mapped to NUMA1

Products

VMware Telco Cloud Platform Essentials

Issue/Introduction

- When lcores are allocated to an NSX EDP Switch if the number of allocated lcores exceeds the number of physical queues available on the physical NIC, the configuration could result on an environment that does not guarantee NUMA alignment for all workloads deployed on the server.

As an example, the following EDP switch is configured with 18 lcores:

Example: ENS switch list:

name            swID maxPorts numActivePorts numPorts mtu   numLcores lcoreIDs
------------------------------------------------------------------------------
DvsPortset-3    1    128      8              8        9000  18        4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

But the number of available queues for the PNIC is 8:

ENS port list for Switch DvsPortset-3

portID      ensPID TxQ RxQ hwMAC             numMACs  type         Queue Placement(tx|rx)
------------------------------------------------------------------------------
23152xxxx  0      8   8   00:00:00:00:00:00 0        UPLINK       4 5 6 7 8 9 10 11 |5 4 5 6 7 8 9 10 
23152xxxx  1      8   8   00:00:00:00:00:00 0        UPLINK       4 5 6 7 8 9 10 11 |5 4 5 6 7 8 9 10

- This resulted on lcore allocation that only allocated lcores from NUMA0 for all uplink data processing

Note how the lcores associated to Uplinks (4 – 11) are only on NUMA0:

ENS NUMA affiniy
Lcore ID  Switch        Affinity
--------  ------------  --------
       0  DvsPortset-2         0
       1  DvsPortset-2         0
       2  DvsPortset-2         1
       3  DvsPortset-2         1
       4  DvsPortset-3         0
       5  DvsPortset-3         0
       6  DvsPortset-3         0
       7  DvsPortset-3         0
       8  DvsPortset-3         0
       9  DvsPortset-3         0
      10  DvsPortset-3         0
      11  DvsPortset-3         0
      12  DvsPortset-3         0
      13  DvsPortset-3         1
      14  DvsPortset-3         1
      15  DvsPortset-3         1
      16  DvsPortset-3         1
      17  DvsPortset-3         1
      18  DvsPortset-3         1
      19  DvsPortset-3         1
      20  DvsPortset-3         1
      21  DvsPortset-3         1

With that configuration, a data plate intensive application running on NUMA1 would have to cross NUMA boundaries to be able to send traffic on the physical network, resulting on lower throughput for that application when compared to the same app running on NUMA 0.

For more details on recommended configuration to guarantee NUMA alignment refer to the NUMA alignment on multi-sockets systems section on the latest TCP Performance Tuning Guide.

Resolution

Make sure you validate the following:

Step 1: Determine the Max lcores support by the host:

- ssh to ESXi host and run the following command:

# esxcli network ens maxLcores get

26

Step 2: Determine the Max Physical NIC driver queue supported in your host:

- ssh to ESXi host and run the following command:

# nsxdp-cli ens port list
portID      ensPID TxQ RxQ hwMAC             numMACs  type         Queue Placement(tx|rx)
------------------------------------------------------------------------------
22817xxx  0      8  8  0c:42:a1:98:88:08 0        UPLINK       0 1 2 3 4 5 6 7 

note: Make a note of TxQ and RxQ  example here 8 .

Step 3: Align the lcore on ENS based on the driver queue which is 8 instead of 18 lcores.

NOTE: If driver module supports more that 8 queues, this needs to be update to max number of driver module parameters , kindly involve the driver vendor before making the changes.

Conclusion: As we noticed in the TxQ and RxQ from the driver which is 8 however, in the ENS Switch list we had 18 lcores which is overcommit / misaligned hence we notice the throughput issue on the application.