SNMP service crashes and generates snmpd-zdump files

Products

VMware vSphere ESXi

Issue/Introduction

To ensure the snmp service doesn't crash enabling the hosts to be monitored by the management systems.
ESXi Versions Affected -: 6.5 and 6.7 .

Symptoms:
SNMP service crashes and generates snmpd-zdump files continuously.
The LEAK in snmp memory is a result of LLDP configuration creating zdumps/crash in ESXi every few minutes.
"Memstat" also shows that snmpd had exhausted its allocated memory .

From vobd.log we see snmpd is getting crashed and generated core dumps

2019-09-07T19:51:23.965Z: [UserWorldCorrelator] 2827682500644us: [esx.problem.application.core.dumped] An application (/bin/snmpd) running on ESXi host has crashed (2 time(s) so far). A core file may have been created at /var/core/snmpd-zdump.001.

vmkernel.log show the below entries are getting repeated frequently

2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 471: Admission failure in path: snmpd/snmpd.19338545/uw.19338545
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 478: UserWorld 'snmpd' with cmdline '/sbin/snmpd'
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 478: UserWorld 'snmpd' with cmdline '/sbin/snmpd'
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 489: uw.19338545 (127316410) extraMin/extraFromParent: 256/256, snmpd (805) childEmin/eMinLimit: 5630/5632
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 471: Admission failure in path: snmpd/snmpd.19338545/uw.19338545
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 471: Admission failure in path: snmpd/snmpd.19338545/uw.19338545
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 478: UserWorld 'snmpd' with cmdline '/sbin/snmpd'
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 478: UserWorld 'snmpd' with cmdline '/sbin/snmpd'
2019-09-18T05:25:41.137Z cpu74:19338545)MemSchedAdmit: 489: uw.19338545 (127316410) extraMin/extraFromParent: 33/33, snmpd (805) childEmin/eMinLimit: 5630/5632

Environment

VMware vSphere ESXi 6.7

Cause

Snmpd running out of resources allocated and need a bigger resource pool and/or there's a leak
Unsigned allocations have 236595 instances taking 0xd11588(13,702,536) bytes.
   Unsigned allocations of size 0x58 have 111625 instances taking 0x95e318(9,823,000) bytes.
   Unsigned allocations of size 0x18 have 99450 instances taking 0x246b70(2,386,800) bytes.
   Unsigned allocations of size 0x28 have 18801 instances taking 0xb79a8(752,040) bytes.
   Unsigned allocations of size 0x68 have 6645 instances taking 0xa8b88(691,080) bytes.
   Unsigned allocations of size 0x258 have 70 instances taking 0xa410(42,000) bytes.
   Unsigned allocations of size 0x268 have 2 instances taking 0x4d0(1,232) bytes.
   Unsigned allocations of size 0x688 have 1 instances taking 0x688(1,672) bytes.
   Unsigned allocations of size 0x1268 have 1 instances taking 0x1268(4,712) bytes.
236595 allocations use 0xd11588 (13,702,536) bytes.

The memory stat can be captured by the following command . Below are two examples from LAB with normal SNMP and other from customer in problematic set up.

#memstats -r group-stats -f -u mb -s name:parGid:min:max:consumed:rminpeak -u mb > /tmp/snmp.txt 

#cat /tmp/snmp.txt | grep -E "consumed|snmp" 
Selected columns : gid:name:parGid:min:max:rMinPeak:consumed gid name parGid min max rMinPeak consumed 
796 snmpd 15 0 0 0 0 > this is normal usage >> from LAB  

#cat /tmp/snmp.txt | grep -E "consumed|snmp" 
Selected columns : gid:name:parGid:min:max:rMinPeak:consumed gid name parGid 
805 snmpd 15 22 22 23 21 >> this is exhaustive usage of memory by snmpd >> from customer environment (with issue )

Resolution

There fix is in 6.5 U3 and 6.7 U3

Workaround:
Following workaround is only for temporary fix . IT should be done with expert technician from VMware and EMC only.

Run a cron job to restart the snmpd service daily .

Perform the following -:

1. cd var/spool/cron/crontabs
2. chmod +x root
3. vi root
4. Add one line 30 1 * * * localcli system snmp set -e 0 && localcli system snmp set -e 1
The above example runs the command ( localcli system snmp set -e 0 && localcli system snmp set -e 1 ) every day at 01:30 AM.
This CRON job will re-enable the SNMP service everyday . You may get more information in KB about the cron job
https://kb.vmware.com/s/article/1033346
You may want to adjust the time to run it at a different hour on when your system load ( CPU MEMORY and Storage ) load is typically lowest.

5. restart cron by following command
/usr/lib/vmware/busybox/bin/busybox crond

Another work around :
The LEAK in snmp memory is a result of LLDP configuration. If LLDP is turned off snmp would not leak and crash as a result.

Additional Information

Impact/Risks:
None