Can you help me understand CPU%

Products

SYSVIEW Performance Management NXBRIDGE - SYSVIEW/ENDEVOR

Issue/Introduction

What is the effect of an MVS Threshold of CPU% with a resource of ALL on a system with a zIIP processor.
If the CPs are running at 90%, but the zIIP is running at 10% will CA Sysview trigger a Warning or Problem event, if the Threshold were set at 90% ?
Or does the zIIP being at 10% reduce the "ALL" percentage ?
Why I would use CPU% versus CECCP% or CECCPU% ?

Environment

Release:
Component: SYSVW

Resolution

The metric CPU% ALL is an average of the usage of general purpose CPs, zIIPs, and zAAPs (aka IFA).
If you have 3 CPs all pegged at 100% respectively, and 1 zIIP that is at 0% usage, then CPU% ALL will be the average of those 4 engines.
More specifically, it will be (100% + 100% + 100% + 0%) / 4 = 75%. So CPU% ALL would be 75% in this case.

To answer the question in the example, if all CPs are at 90% but the zIIP is at 10%, the CPU% ALL threshold will not trigger as the lower zIIP percentage will bring down the overall average to something less than 90%, depending on how many CPs and zIIPs you have.
If you are concerned more with general purpose CP usage, we would recommend setting a threshold for CPU% CP to threshold on just CP usage.
You can also set a separated CPU% IIP threshold to monitor the zIIPs all on their own.

With regards to CPU% vs CECCP% vs CECCPU%: CPU% is at the image level.
It is the average of how busy MVS sees its logical processors. As noted before, you can use ALL, CP, IIP, and IFA to monitor usage for specific types.
CECCP% is at the hardware level. It is the average of how busy the general purpose CPs are on the box.
CECCPU% is at the hardware level. It is the average of how busy the general purpose CPs, zIIPs, and zAAPs are on the box.
CPU% is the much more commonly used metric.
You would use it to see if you are using too much CPU for too long if you are concerned about rolling 4 hour average, or if it gets to 100%, then you know critical work is likely being delayed and elongated. CECCP% and CECCPU% show you how effectively you are using the entire box. It is more of a capacity planning aid, but people usually use SMF to figure those types of problem out.
You can set a threshold on it if you have a lot of work that dynamically moves around from LPAR to LPAR and if this number ever gets to 100%, your entire box is being consumed. At that point, you have some major problems on your hands. Either you are under capacity or your have some workload that multiple systems on the box dominating the entire set of hardware.
It could be possible that no single MVS image is using 100% of CPU% CP, but CECP% is at 100% based on how you apply processor weights. In that case, you could use this as a tool to be alerted that you may not have set up LPAR/processor weights effectively.
There are all kinds of obscure use cases, but classically, CPU% has proven to be easier to work with and more understandable over CECCPU% and CECCP%.
If an LPAR has no specialty engines, then setting the threshold for CPU% ALL and CPU% CP will be EXACTLY the same, as you have hinted at.
If all 6 CPs are running at 100% respectively, then the average for them will be 100% for CPU% ALL and 100% for CPU% CP.
The same data you see on the CPU command in SYSVIEW is the exact same data (averaged over a minute) use for SYSVIEW threshold processing.
With regards to parked CPUs, things can get a little more complicated. SYSVIEW does not consider the time a CPU is parked as available time.
For example, if you had an LPAR with 2 CPs, 1 was running at 100% for a minute, and the other was parked for the entire minute, SYSVIEW will exclude the parked time and only consider the time the CPs were not parked. So in this example, SYSVIEW would show CPU% CP as 100%.
Another example, if you had an LPAR with 2 CPs, 1 was running at 100% for a minute and the was parked for 30 seconds, then at 50% for the other 30 seconds, then SYSVIEW will consider only the time the CP was not parked for 30 seconds. The calculation would look like this, assuming 100% of 60 seconds = 60 seconds of active time and 50% of 30 seconds = 15 seconds of active time. Active time = 60 seconds + 15 seconds = 75 second Available time = 60 second + 30 seconds = 90 seconds Active / Available = 75 / 90 = 83.3% So in the case above CPU% CP would be 83.3%. It is certainly possible for the CPU% CP to get to 100% even if a CP is parked as you can hopefully see from the examples above.
Within SYSVIEW, there is no control of this behavior. This behavior is done specifically because this effectively tells us how much processing power was given to the operating system instead of how much theoretically could have been provided in the best case scenario. There are pros and cons to both ways of looking at it.
So then, how can you avoid CPU% alerts if a CP is parked? You can try setting the DURATION on the threshold to a higher value. We think the default is 2 minutes, you could change it to 3, 4, or even 5 minutes to try to avoid periods of time a CP is parked but all other CPs are very busy. It would be very unusual for you to see CPU% averaging closed to 100% for long periods of time will Hiperdispatch allows CPs to be parked. Probably a less favorable option, but you could write a REXX rule in OPS/MVS to perform some additional analysis when the CPU% alert is triggered in OPS.
You could use the SYSVIEW REXX API to interrogate the CPU command looking for parked CPs, and if there are some, ignore the alert with OPS. With regards to looking at each engine, this is typically a very specific use case.
If you know that CP#6 should be parked most of the time, you could set a threshold just for it. It that CP ever gets to 100%, then you *MIGHT* have a problem.
Note that if CP#6 gets unparked and a task starts looping on it, it is possible CP#6 would be using 100%, but the other CPs might be using very little. MVS tries to keep the same dispatchable unit of work on the same engine to avoid caching problems.
Setting thresholds for each CP engine, while it can be done, is typically not done for this reason, let alone you will have to update the threshold definitions if you change your hardware configuration too.