Understanding Simple Network Management Protocol Timeouts

Products

VMware Cloud Foundation VMware vSphere ESXi

Issue/Introduction

Symptoms:

Customer applications report the following:

Agent does not reply
Customer reports the same symptoms as in the KB article Snmpwalk command times out when more than 10 LUNs are attached to an ESXi host (2059590)
Third party management software reports timeout

Environment

VMware EVO:RAIL
VMware EVO:RAIL 1.x

Cause

This issue occurs when a given management application sends a request to a Simple Network Management Protocol (SNMP) Agent and waits for a configured amount of time, usually in seconds, for a response. If a response from an SNMP Agent, such as those found in ESXi or vCenter Appliance, is not received before the timeout, then the application either retries the request again or reports failure.

The causes for a timeout may be one of these:

Network is unstable, routing/bridging can take 2-30 seconds to converge.
SNMP Agent fetches state requested at time of request (cache miss) from vmkernel/drivers, which takes time.
The rate of requests to the agent exceed its ability to reply filling the IP Stack UDP ingress queue.
Invalid credentials given to the agent where agent simply drops the request.
snmpd agent crash or agent is stopped.

Resolution

To resolve this issue, in most cases, these sequence can be used:

On the VMware system in question, check that the agent is running and there are no core dumps in /var/core.
Ping the agent from the management station to verify connectivity.
Verify snmpd credentials used (community string or v3 user) work using a mib browser application. Linux, Apple OS/X have snmpwalk command line tool. For Microsoft OS, suggest using: ireasoning.com or ipmonitor from SolarWinds.com.
If a mib walk from these tools over the agent results in timeouts, then find the last object polled. This object is the issue.
Determine the timeout that works for that object, adjust the timeout up by factors of 2 (2, 4, 8, 16, 32 seconds) till there is no timeout. Note the agent caches internally the result so wait 30 seconds between requests.
Determine if there is a congestion, look at Recv Q from: localcli network ip connection list on ESX, or on linux: netstat –nap.

Timeout explained

Here is the general formula for setting a timeout in Management applications:

Mgmt_App_Timeout = (Transit time there and back between Manager app and SNMP agent + Agent Processing time) * 2

Almost all management applications use a default timeout in the range of 2-5 seconds with a retry of 2 to 3 times.

The maximum wait time for a client application is the timeout and the number of retries when there is no congestion. For example, with a timeout of 2.5 seconds and two retries, the maximum wait time is 5 seconds as measured from receipt of first request by the agent to the time it replies to that request.

For protocols that use UDP such as SNMP, it is up to the application to decide the value of the timeout. A timeout does not mean that no response was sent. It means that the response did not arrive by the arbitrary setting in the management application configuration. Further more, the SNMP Manager applications used by customers today does not dynamically adjust timeout based on observed response time. So identifying the factor the application uses for a timeout and then checking the time taken by a VMware Agent to return a real time data is often the core issue faced by the customers.

Retries explained

Retries value specifies the number of times a request has to be resent after a timeout has occurred.

When an agent has a timeout that is too short, the application sends requests to a queue in the IP stack receive queue as VMware SNMP Agents process one request at a time. This happens unless the application changes the xid in the request.

A response to the first request appears to the management agent as the reply to the last sent request. This explains the reason for the maximum wait time to be the by-product of the retry and timeout values.

Congestion explained

Congestion occurs with all protocols, but those that use UDP as the transport have no mechanism in the IP stack to handle it. Where HTTP over TCP slows down the sending rate when no responses come quickly, SNMP over UDP does not slow down the sending rate. This can be a good thing when there is a packet loss between a sender and a receiver. SNMP over UDP is more reliable than TCP based management protocols with as little as 5% packet loss between senders. TCP connections timeout/never complete since the response to delay is too slow in sending the packets where SNMP continues to send at same or increased rate such that more packets get through. Since SNMP also has no three-way handshake that TCP incurs every packet has equal chance of getting answered.

However, in a normal network the agent appears to not respond when packet arrival rates exceed response since new packets arriving cannot be added to the IP/UDP stack input queue that has since filled when its rate of response is less than the rate of requests and thus would be dropped before the agent could read enough packets out of the queue. The queue size would then need to be made adjustable – it is not as of ESXi 5.5 release and/or the amount of CPU time/resource pool may be upped to improve response rate.

Additional Information

Some applications like the Orion NPM application for vendor SolarWinds do not have per-system adjustable timeout. Have them adjust the number of retries if possible. This issue is a compatibility problem for the customer not a failure of the SNMP Agent. There is no IETF standard for the response time/retry since every customer configuration may be different. Lastly, VMware does not specify a maximum response time from the agent since the apis. Internal (public and private) use also do not provide latency boundaries.

http://www.net-snmp.org/wiki/index.php/Tutorials
http://www.webnms.com/cagent/help/technology_used/c_snmp_overview.html
Only v1/v2c, no v3: nethttp://technet.microsoft.com/en-us/library/cc783142%28v=ws.10%29.aspx
http://www.brocade.com/downloads/documents/html_product_manuals/NOS_MIB/wwhelp/wwhimpl/common/html/wwhelp.htm#context=53-1002490-01&file=1_Intro.03.2.html
http://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/7244-snmp-trap.html

Snmpwalk command times out when more than 10 LUNs are attached to an ESXi host