Capacity team is reporting periods of high CPU in Netmaster. What is causing this?
Netmaster for TCP/IP Release : 12.2
This is one example of symptoms. Other situations may show various other messages, or there may be no obvious symptoms in the log.
Symptoms in the logs correlated to the graph.
1. Graph shows over 150 MIPS for all timeframes starting in early November, except where it drops abruptly in the evening on Nov 17, remaining low until early November 20th.
2. The log shows these pairings of 'burst' messages, indicating that Netmaster is protecting the region against excessive incoming events for the time between these messages.
01.15.41.91 BG-SYS NMBGBSYS IPLO8120 IPLOG Processing changed from 'ACTIVE' to 'BURST-MODE' due to incoming records
greater than burst protection limit of .
01.22.10.56 BG-SYS NMBGBSYS IPLO8121 IPLOG Processing changed from 'BURST-MODE' to 'ACTIVE'
3. The IPLOG was BAD-LOCKED on November 17th. This is also the last 'burst' message that occurs until the region is restarted on the morning of November 20th.
18.10.32.13 BG-SYS NMBGBSYS N87902 VSAM ERROR ON NDB FILE: IPLOG VFS REQUEST CODE: 16 R15: 16 FDBK: 1C
18.10.32.13 - - N87V91 NDB: IPLOG KEYSTATS/RIDSCAN STOPPED AT NDB SHUTDOWN (RESTART PENDING).
18.10.32.14 BG-SYS NMBGBSYS N16C63 ALLOC MDO OPERAND INVALID, MDO KEY=11
18.10.32.52 BG-SYS NMBGBSYS IPLO8121 IPLOG Processing changed from 'BURST-MODE' to 'ACTIVE'
18.10.37.14 - - N87150 NDB IPLOG HAS STOPPED.
4. When IPLOG housekeeping starts at 01:30 on the 18th, it successfully reorgs the IPLOG.
Note that the number of records processed is over 9 million. The log should not contain anywhere near that number of records as it becomes unusable.
01.30.25.30 BG-SYS NMBGBSYS IPLO8221 IPLOG reorg: copying records to INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ
01.30.40.01 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: REPRO IFILE(IPLOG) ODS(INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ)
01.30.40.01 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: IDC0005I NUMBER OF RECORDS PROCESSED WAS 9149333
01.30.40.01 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0
01.30.40.01 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: IDC0002I IDCAMS PROCESSING COMPLETE. MAXIMUM CONDITION CODE WAS 0
01.30.40.01 BG-SYS NMBGBSYS IPLO8222 IPLOG reorg: copying records from INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ
01.30.52.41 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: REPRO IDS(INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ) OFILE(IPLOG) REUSE
01.30.52.48 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: IDC0005I NUMBER OF RECORDS PROCESSED WAS 9149333
01.30.52.48 BG-SYS NMBGBSYS IPLO8223 IPLOG reorg: IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0
5. After restart of the region on November 20th, housekeeping - specifically the record deletion - of the IPLOG is excessively slow.
Where the region starting on November 13th deletes over a million records in an hour, the new region with the reorged IPLOG is taking until 22:35 for half that amount.
6. The IPLOG is so huge that it appears sections of Record IDS are not being cleared.
Of the 14 days worth of logs that were sent, housekeeping did not delete any records on 6 of those days, when it should delete everything older than 4 days.
On other days, the deleted records fluctuated between 50,000 and over a million records.
I would expect the amounts to be more consistent if it was working properly.
The way housekeeping works, it looks at a portion of the log and clears all records in that section that are expired. It should normally work its way sequentially through the entire log. This is not happening. Sections are being missed, and other sections are reviewed that are either already deleted, or have data that is still valid. That means that the log continues to grow, even as some areas are not cleared for reuse.
7. ReportCenter is running on this system, and we see messages every 30 minutes. These messages continue throughout, even when the IPLOG is bad-locked.
02.02.10.87 BG-SYS NMBGBSYS WRCALL23 ReportCenter STC NMJAVA active, connected to DB2/NETMJAVA DB2 DSN12015 IBM DB2 JD
BC Universal Driver Architecture 3.72.54 jdbc:db2:TSYSD2TZ
02.02.10.88 BG-SYS NMBGBSYS WRCALL23 NMJAVA JVM Memory In Use 10,307,072 Total 192,282,624 Free 181,975,552
02.32.10.82 BG-SYS NMBGBSYS WRCALL23 ReportCenter STC NMJAVA active, connected to DB2/NETMJAVA DB2 DSN12015 IBM DB2 JD
BC Universal Driver Architecture 3.72.54 jdbc:db2:TSYSD2TZ
02.32.10.83 BG-SYS NMBGBSYS WRCALL23 NMJAVA JVM Memory In Use 14,343,112 Total 184,877,056 Free 170,533,944
greater than burst protection limit of .
CONCLUSION: ReportCenter is not related to the problem. It runs consistently during both high and low CPU times.
The large amount of information destined for the IPLOG as indicated both by the 'burst' messages and the vast number of records being deleted during housekeeping is causing the 'high' CPU. The drop in MIP utilization occurred when the IPLOG was BAD-LOCKED and thus no records were being processed. The cessation of the 'burst' messages from late on the 17th until the region was restarted on the 20th matches that MIPS decrease as well.
STEP 1: Determine the cause of the excessive records being written to the IPLOG
Attempting to browse the IPLOG by viewing all connections in /IPHIST receives an NDBSCAN error because the number of records in the log exceeds the searchable limit.
The technique outlined below can be used to diagnose any situation where an excess of records are being written to the IPLOG
Go to /IPHIST --> Browse Connection and Event Details --> Perform Custom Search
Using $IPKEY as the field to be searched limits the number of records scanned to the timestamps defined in the VALUE field, so it is possible to see a portion of a very large IPLOG.
Did a search on all records for the current day.
D=Delete I=Insert R=Repeat
"(" Field Opr Value Gen ")" Bool
( + $IPKEY + GE 20230110000 + + AND
+ $IPKEY + LE 20230110999 + ) +
The log shows 172099 records for that date. Of note is that $IPKEY is GMT and the date displayed is local, so it shows 5 hours from the prior day in the display.
NMSTR2T--------------- TCP/IP : Custom Search Results ------------------138/172099
Command ===> Scroll ===> CSR
S/B=Browse
Type Cmnd Time Date Sec? Userid RemoteIPAddress LUname
CN TERM 17:21:18 09-JAN-2023 NO DB2L 10.17.118.62
CN TERM 17:21:18 09-JAN-2023 NO DB2L 10.128.203.53
CN TERM 17:21:19 09-JAN-2023 NO DB2L 10.114.77.80
CN TERM 17:21:19 09-JAN-2023 NO DB2L 10.114.77.80
In this instance, it shows masses of TERM records for Userid DB2L on an unsecured port, often multiple connections within the same second.
On this system, unsecured connections are not allowed, so further review shows a very rapid open/close of the connection and an error message because there is no ATTLS.
PF11 to view additional fields shows are for Jobname DBLNDIST, and PF11 shows most for local port 5564.
Adding the port as criteria in the custom search
"(" Field Opr Value Gen ")" Bool
( + $IPKEY + GE 20230110000 + + AND
+ $IPKEY + LE 20230110999 + + and
+ $IPLCLPORT + EQ 5564 + ) +
brings back about 145000 records.
So approximately 75% of the records are from this one job/port.
NOTE: the full list of field names can be searched and selected by using '?' as the 'Field' value and pressing <Enter>
The same custom search with a date of two days prior, also just for that port, it shows 449232 records.
Remove the port, and there are 526665 records total for the day. So again, about 3/4 of all records are from this one task/port.
STEP 2: Update /IPAPPLS to prevent the offending application from writing to the IPLOG
1. Review the defintion in /IPAPPLS. In this case it shows
- there are several DB2 tasks.
- for all tasks, no port is defined and TERM records are being written to the IPLOG.
Appl Appl TCP/IP Stack Local Port Deliver Collect Expiry
Order Name Base Jobname Range Records Stats Period
20100 CA7 TCPIP1 NONE YES 0
20500 NMSTR TCPIP1 NONE YES 0
23456 12345-12345 NONE NO 0
30932 DB2 TERM YES 0
30933 TELNET NONE YES 0
30934 TELNET NONE YES 0
30935 DB2 TERM YES 0
30936 CICS NONE YES 0
30937 DB2 TERM YES 0
30938 DB2 TERM YES 0
30939 DB2 NONE YES 0
PF11 shows the actual application name.
Note that the sequence numbers for the many and varied tasks are all contiguous - there are no gaps.
D2TSDIST is sequence number is 30981.
Appl Appl Application Generate
Order Name Base Jobname Name
20100 CA7 CA7* ASIS
20500 busapp- JOBNAME
23456 JOBNAME
30932 DB2 DBLCDIST ASIS
30933 TELNET TCPTEL ASIS
30934 TELNET TCPTN13T ASIS
30935 DB2 DBLRDIST ASIS
30936 CICS CICS11NN ASIS
30937 DB2 DBLKDIST ASIS
30938 DB2 DBLNDIST ASIS
30939 DB2 DBLSDIST ASIS
2. Create a new, more specific entry for DBLNDIST to prevent records from port 5564 from being written to the activity log.
Because /IPAPPLS works like a sieve, this new entry must be higher in the list than DBLNDIST so that it is filtered out.
This is done using the Appl Order. Because several entries have contiguous numbering, the new entry must be somewhere before 30932.
Selecting DBLNDIST with C, created a new definition with sequence number 30900, Local PORT 5564 and set DELIVER RECORDS to NONE.
Application Name Entry
Order .................... 30900
Application Name Base .... DB2
Connection Match Criteria
Application Jobname ...... DBLNDIST
TCP/IP Stack Jobname .....
Local Port Range ......... 5564
Remote Address Range .....
Remote Port Range ........
Processing Options
Generate Name ............ ASIS
Deliver Records .......... NONE
Write INIT Records ....... DEFAULT
Write TERM Records ....... DEFAULT
Write zERT Records ....... DEFAULT
Collect Statistics? ...... YES
Expiry Period (minutes) 0
Initially, the problem with CPU may persist somewhat due to the size of the IPLOG, and the associated overhead with housekeeping, but the BURST messages and CPU associated with gathering records should be much less.
STEP3: Create a new IPLOG to resize the IPLOG back to a normal size.
Use IDCAMS to generate a new file. If the current IPLOG content is needed, doing a physical reorg can be part of that.
This can be done on a running system, or when cycling the region.
For details on how to do this, including sample JCL, please review KB 123157
In most instances, problems of this type are due to an application generating excess records.
In rare instances, the cause may be due to excessive FTP or Telnet traffic.
Important to note is whether the problematic record type is CN (connection records written by the stack and controlled in /IPAPPLS), or records written directly by the FTP client/server (FC or FS), or Telnet server (TS).
FTP and TELNET records are controlled by /PARMS IPEVENT; they are either enabled or disabled.