What does the dmdebug --queues output mean in Smarts?

Products

VMware

Environment

VMware Smart Assurance - SMARTS

Resolution

What does the dmdebug --queues output mean in Smarts?

A common debugging tool is dmdebug --queues which prints information about all server queues in the system.

dmctl -s <server name> exec dmdebug --queues --output=<file name>

file name will be in <BASEDIR>/smarts/local/logs/

Before Smarts 6.0, this command printed for each queue:

The name
The number of workers
The current size
The approximate maximum size ever attained. This information was only available for server queues; it always printed as 0 for timed queues.

Smarts 6.0 and above prints a great deal of additional information. Here's an example:

SubscriberFE                                                                     [0 worker(s)]
Current size 0, max 369; processed entries 21843
Size                0-0.00               0-0.00                0-0.00               0-0.00
Flow      21839+0.06      21803+0.05      21446+0.05      17895+0.06
Late     0.0000+0.00     0.0000+0.00     0.0000+0.00    0.0000+0.00

The first line gives you the name of the queue and the number of workers. Note the Subscriber Front Ends like this one never have any workers since they are not used as normal server queues. The second line gives you the current size, and an *exact* maximum size. This maximum size is now reported correctly for timed queues. Finally, there is a new value, the total number of entries "processed". To be exact, this is the total number of entries that have ever been pulled off the queue.

The next three rows provide averages. Each column reports values averaged over a different time interval: The last 3 minutes, 30 minutes, 300 minutes (5 hours), and 3000 minutes (50 hours, or about 2 days). Each row reports on a different value: Size is the queue size (i.e., it's an average of what "Current size" reports for this moment), Flow is the number of queue entries processed (i.e., it's an average of what "Current processed entries" reports), and Late is the lateness: The delay between when an entry in a timer queue was scheduled to run and when it actually ran. (The Late row is always all 0's for a server queue. It may be changed in future releases to report how long elements are staying in a server queue.)

Each individual entry has two fields, separated by a sign (+ or -). The first field is the average value of the variable being reported on over the appropriate period. This particular queue has processed many entries - almost 22,000 - and once even held 369 entries; but that was apparently an isolated peak some time in the past, since its *average* size over all the reported intervals is 0: Most of the time, the queue is empty. On the other hand, the Flow value over the last 3 minutes is very close to the current value. If we look at this value averaged over longer and longer periods, we see that the longer the period, the smaller the value. This is what we would expect if the rate at which elements are "flowing through" this queue is roughly constant. (Always keep in mind that Size grows and shrinks, while Flow can only grow. You can't directly compare them.)

The second field is the average rate of change in the first value *per second*. In the example above, the Size is shrinking slowly - less than 0.01 entries/second. On the other hand, the Flow is increasing at a steady rate - between 3.0 and 3.6 entries per minute are flowing through this queue. (Again, because Flow can only increase, the second field is always non-negative. Note, BTW, that the rates are pretty much the same across all time periods - consistent with the analysis in the previous paragraph that entries are flowing through at a slow, steady rate.)

The above definition can lead to a couple questions:

Shouldn't the average flow rate times the length of the period give the actual value?
How can we have 21446+0.05 for the 300 minute value and 17895+0.06 for the 3000 minute value?

Flow is *cumulative*. The expectation is false! From these numbers, in the last 300 minutes, *approximately* 21446-17895=3551 entries were processed. *If* the server has been up exactly 50 hours, and the actual rate were constant, we'd expect about 10 times as large an increase over the last 300 minutes as over the last 3000. However, (a) this server has been up for well over 50 hours; (b) the 3000-minute average *includes* the entries from the 300-minute average. So the comparison isn't valid. (Even so, our expectation is only off by a factor of about 2.)