This article is to provide the information to troubleshoot the issue and how to work around the current issue.
RX packet drops seen on NICs using bnxtnet driver.
Entries like the following are seen in ESXi kernel logs.
2021-02-26T14:03:24.801Z cpu91:2097484)WARNING: bnxtnet: alloc_rx_buffers:2094: [vmnic2 : 0x45033c788000] Failed to allocate all, init'ed rx ring 2 with 2822/4092 pages only
2021-02-26T14:03:24.820Z cpu91:2097484)WARNING: bnxtnet: alloc_rx_buffers:2094: [vmnic1 : 0x45033c7b6000] Failed to allocate all, init'ed rx ring 7 with 22/3069 pages only
The packet page pool memory is exhausted.
Below is the command to check the usage infomation of packet page pool(netPktPagePool). if consumed or consumedPeak value was already reach/close to max, the issue would be observed.
[root@esxhost:~] memstats -r group-stats -s gid:name:max:consumed:consumedPeak -u mb | grep netPktPagePool
gid name max consumed consumedPeak
---- ---------------- ----- ----- -----
163 netPktPagePool 1260 1260 1260
When the driver is operating in the Enhanced Datapath / Enhanced Network Stack (ENS), there is a higher probability of occurrence due to higher NIC RX ring usage. There is also a higher probability of occurrence in ESXi 6.7 version earlier than 6.7 Patch 6, and ESXi 7.0 versions earlier than 7.0 Update 1, due to very small default NetPagePool size limit in these versions.
mtu >= 4000 or enabling HW LRO.
The bnxtnet driver uses a special pool of memory, known as NetPagePool, to receive LRO packets or packets larger than 4K. When many NIC RX rings are in use to process these types of packets, the required memory may exceed the default size limit of NetPagePool. Depending on when NetPagePool exhaustion happen, it can cause failure to initialize the NIC device in ESXi, complete loss of RX traffic on the NIC, or RX drop.
There are two workarounds available.
The max value of the packet page pool is determined by two parameters: netPagePoolLimitCap and netPagePoolLimitPerGB. The max limit would be the smallest of the two max numbers which are calculated separately based on these two parameters.
The max value of the packet page pool = MIN(SystemMemoryNumGB *netPagePoolLimitPerGB, netPagePoolLimitCap)* 4096
So, first of all, we need to find out which parameter limited the current max value by checking the current max value of netPktPagePool, netPagePoolLimitCap and netPagePoolLimitPerGB. Then adjust the value of netPagePoolLimitPerGB or netPagePoolLimitCap based on the actual case.
The commands to check these above values are as follows:
[root@esxhost:~] memstats -r group-stats -s gid:name:max:consumed:consumedPeak -u mb | grep netPktPagePool
gid name max consumed consumedPeak
---- ---------------- ----- ----- -----
163 netPktPagePool 1260 1260 1260
[root@esxhost:~] esxcli system settings kernel list |grep netPagePoolLimitCap
Name Type Configured Runtime Default Description
------------------- ------ ---------- ------- ------- -----------
netPagePoolLimitCap uint32 1048576 1048576 1048576 Maximum number of pages period for the packet page pool.
[root@esxhost:~] esxcli system settings kernel list |grep netPagePoolLimitPerGB
Name Type Configured Runtime Default Description
------------------- ------ ---------- ------- ------- -----------
netPagePoolLimitPerGB uint32 5120 5120 5120 Maximum number of pages for the packet page pool per gigabyte.
The commands to adjust netPagePoolLimitPerGB and netPagePoolLimitCap are as below:esxcli system settings kernel set -s netPagePoolLimitPerGB -v <value>
esxcli system settings kernel set -s netPagePoolLimitCap -v <value>
Note: After adjusting netPagePoolLimitPerGB and netPagePoolLimitCap, it is required to reboot the ESXi host to take effect.
Usually, it's recommended that the max value of netPktPagePool is 4G if system memory is sufficient. Thus, for netPagePoolLimitCap, the corresponding value recommended is 1048576.
(Note: netPagePoolLimitCap is already 1048576 by default since ESXi 7.0u1 ).
For netPagePoolLimitPerGB, since it is related to the total size of system memory, the value is up to the specific case.
Nic down when adding to vSwitch
https://knowledge.broadcom.com/external/article/318658
Impact/Risks:
Additional memory is in use
Reboot required