IPLOG too large - browsing fails / high CPU in Netmaster
search cancel

IPLOG too large - browsing fails / high CPU in Netmaster

book

Article ID: 260462

calendar_today

Updated On:

Products

NetMaster Network Management for TCP/IP

Issue/Introduction

Capacity team is reporting periods of high CPU in Netmaster.   What is causing this? 

Environment

Netmaster for TCP/IP Release : 12.2

Cause

This is one example of symptoms. Other situations may show various other messages, or there may be no obvious symptoms in the log.

Symptoms in the logs correlated to the graph.
1. Graph shows over 150 MIPS for all timeframes starting in early November, except where it drops abruptly in the evening on Nov 17, remaining low until early November 20th.

2. The log shows these pairings of 'burst' messages, indicating that Netmaster is protecting the region against excessive incoming events for the time between these messages.

01.15.41.91  BG-SYS    NMBGBSYS   IPLO8120 IPLOG Processing changed from 'ACTIVE' to 'BURST-MODE' due to incoming  records 
                                  greater than burst protection limit of .                                                 
01.22.10.56  BG-SYS    NMBGBSYS   IPLO8121 IPLOG Processing changed from 'BURST-MODE' to 'ACTIVE'     



3. The IPLOG was BAD-LOCKED on November 17th. This is also the last 'burst' message that occurs until the region is restarted on the morning of November 20th.

18.10.32.13  BG-SYS    NMBGBSYS   N87902 VSAM ERROR ON NDB FILE: IPLOG VFS REQUEST CODE: 16 R15: 16 FDBK: 1C   
18.10.32.13  -         -          N87V91 NDB: IPLOG KEYSTATS/RIDSCAN STOPPED AT NDB SHUTDOWN (RESTART PENDING).
18.10.32.14  BG-SYS    NMBGBSYS   N16C63 ALLOC MDO OPERAND INVALID, MDO KEY=11                                 
18.10.32.52  BG-SYS    NMBGBSYS   IPLO8121 IPLOG Processing changed from 'BURST-MODE' to 'ACTIVE'              
18.10.37.14  -         -          N87150 NDB IPLOG HAS STOPPED.     


4. When IPLOG housekeeping starts at 01:30 on the 18th, it successfully reorgs the IPLOG.
 Note that the number of records processed is over 9 million. The log should not contain anywhere near that number of records as it becomes unusable.

01.30.25.30  BG-SYS    NMBGBSYS   IPLO8221 IPLOG reorg: copying records to INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ               
01.30.40.01  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg:  REPRO IFILE(IPLOG) ODS(INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ)         
01.30.40.01  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg: IDC0005I NUMBER OF RECORDS PROCESSED WAS 9149333                 
01.30.40.01  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg: IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0        
01.30.40.01  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg: IDC0002I IDCAMS PROCESSING COMPLETE. MAXIMUM CONDITION CODE WAS 0
01.30.40.01  BG-SYS    NMBGBSYS   IPLO8222 IPLOG reorg: copying records from INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ           
01.30.52.41  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg:  REPRO IDS(INSX00P.NETMASTR.M2.TSYS.IPLOGSEQ) OFILE(IPLOG) REUSE 
01.30.52.48  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg: IDC0005I NUMBER OF RECORDS PROCESSED WAS 9149333                 
01.30.52.48  BG-SYS    NMBGBSYS   IPLO8223 IPLOG reorg: IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0   

 

5. After restart of the region on November 20th, housekeeping - specifically the record deletion - of the IPLOG is excessively slow.
Where the region starting on November 13th deletes over a million records in an hour, the new region with the reorged IPLOG is taking until 22:35 for half that amount.

6. The IPLOG is so huge that it appears sections of Record IDS are not being cleared.
    Of the 14 days worth of logs that were sent, housekeeping did not delete any records on 6 of those days, when it should delete everything older than 4 days.
    On other days, the deleted records fluctuated between 50,000 and over a million records.
    I would expect the amounts to be more consistent if it was working properly.
    The way housekeeping works, it looks at a portion of the log and clears all records in that section that are expired. It should normally work its way sequentially through the entire log. This is not happening. Sections are being missed, and other sections are reviewed that are either already deleted, or have data that is still valid. That means that the log continues to grow, even as some areas are not cleared for reuse.

7. ReportCenter is running on this system, and we see messages every 30 minutes. These messages continue throughout, even when the IPLOG is bad-locked.

02.02.10.87  BG-SYS    NMBGBSYS   WRCALL23 ReportCenter STC NMJAVA active, connected to DB2/NETMJAVA DB2 DSN12015 IBM DB2 JD
                                  BC Universal Driver Architecture 3.72.54 jdbc:db2:TSYSD2TZ                                
02.02.10.88  BG-SYS    NMBGBSYS   WRCALL23 NMJAVA JVM Memory In Use 10,307,072 Total 192,282,624 Free 181,975,552           
02.32.10.82  BG-SYS    NMBGBSYS   WRCALL23 ReportCenter STC NMJAVA active, connected to DB2/NETMJAVA DB2 DSN12015 IBM DB2 JD
                                  BC Universal Driver Architecture 3.72.54 jdbc:db2:TSYSD2TZ                                
02.32.10.83  BG-SYS    NMBGBSYS   WRCALL23 NMJAVA JVM Memory In Use 14,343,112 Total 184,877,056 Free 170,533,944           
                                  greater than burst protection limit of .                                                  

     
     
CONCLUSION: ReportCenter is not related to the problem. It runs consistently during both high and low CPU times.
The large amount of information destined for the IPLOG as indicated both by the 'burst' messages and the vast number of records being deleted during housekeeping is causing the 'high' CPU. The drop in MIP utilization occurred when the IPLOG was BAD-LOCKED and thus no records were being processed. The cessation of the 'burst' messages from late on the 17th until the region was restarted on the 20th matches that MIPS decrease as well.

Resolution

STEP 1: Determine the cause of the excessive records being written to the IPLOG

Attempting to browse the IPLOG by viewing all connections in /IPHIST receives an NDBSCAN error because the number of records in the log exceeds the searchable limit.

The technique outlined below can be used to diagnose any situation where an excess of records are being written to the IPLOG

Go to /IPHIST --> Browse Connection and Event Details --> Perform Custom Search 

Using $IPKEY as the field to be searched limits the number of records scanned to the timestamps defined in the VALUE field, so it is possible to see a portion of a very large IPLOG.
Did a search on all records for the current day.

                                                 D=Delete I=Insert R=Repeat
"("   Field          Opr  Value                              Gen ")"   Bool
 (  + $IPKEY       + GE   20230110000                      +         + AND
    + $IPKEY       + LE   20230110999                      +      )  +    


The log shows 172099 records for that date. Of note is that $IPKEY is GMT and the date displayed is local, so it shows 5 hours from the prior day in the display.

NMSTR2T--------------- TCP/IP : Custom Search Results ------------------138/172099
Command ===>                                                   Scroll ===> CSR 
                                                                               
                                                                     S/B=Browse
    Type Cmnd Time     Date        Sec? Userid   RemoteIPAddress  LUname       
  CN   TERM 17:21:18 09-JAN-2023 NO   DB2L    10.17.118.62                
  CN   TERM 17:21:18 09-JAN-2023 NO   DB2L    10.128.203.53
  CN   TERM 17:21:19 09-JAN-2023 NO   DB2L    10.114.77.80
  CN   TERM 17:21:19 09-JAN-2023 NO   DB2L     10.114.77.80

In this instance, it shows masses of TERM records for Userid DB2L on an unsecured port, often multiple connections within the same second.
On this system, unsecured connections are not allowed, so further review shows a very rapid open/close of the connection and an error message because there is no ATTLS.

PF11 to view additional fields shows are for Jobname DBLNDIST, and PF11 shows most for local port 5564.
Adding the port as criteria in the custom search 

"("   Field          Opr  Value                              Gen ")"   Bool
 (  + $IPKEY       + GE   20230110000                      +         + AND
  + $IPKEY       + LE   20230110999                      +         + and
  + $IPLCLPORT   + EQ   5564                             +      )  +

brings back about 145000 records. 
So approximately 75% of the records are from this one job/port.

NOTE: the full list of field names can be searched and selected by using '?' as the 'Field' value and pressing <Enter>


The same custom search with a date of two days prior, also just for that port, it shows 449232 records.
Remove the port, and there are 526665 records total for the day. So again, about 3/4 of all records are from this one task/port.

 

STEP 2: Update /IPAPPLS to prevent the offending application from writing to the IPLOG

1. Review the defintion in /IPAPPLS. In this case it shows
    - there are several DB2 tasks.
    - for all tasks, no port is defined and TERM records are being written to the IPLOG.

Appl   Appl           TCP/IP Stack   Local  Port   Deliver  Collect  Expiry
Order  Name Base      Jobname           Range      Records  Stats    Period
20100  CA7            TCPIP1                      NONE     YES      0    
20500  NMSTR        TCPIP1                      NONE     YES      0    
23456                                12345-12345   NONE     NO       0     
30932  DB2                                   TERM     YES      0    
30933  TELNET                                      NONE     YES      0    
30934  TELNET                                      NONE     YES      0     
30935  DB2                                   TERM     YES      0    
30936  CICS                                        NONE     YES      0     
30937  DB2                                   TERM     YES      0    
30938  DB2                                   TERM     YES      0    
30939  DB2                                         NONE     YES      0     

PF11 shows the actual application name.

Note that the sequence numbers for the many and varied tasks are all contiguous - there are no gaps.
D2TSDIST is sequence number is 30981.

Appl   Appl           Application    Generate
Order  Name Base      Jobname        Name    
20100  CA7            CA7*           ASIS    
20500  busapp-                       JOBNAME 
23456                                JOBNAME 
30932  DB2            DBLCDIST       ASIS  
30933  TELNET         TCPTEL         ASIS    
30934  TELNET         TCPTN13T       ASIS  
30935  DB2            DBLRDIST       ASIS  
30936  CICS           CICS11NN       ASIS    
30937  DB2            DBLKDIST       ASIS  
30938  DB2            DBLNDIST       ASIS  
30939  DB2            DBLSDIST       ASIS  

 

2. Create a new, more specific entry for DBLNDIST to prevent records from port 5564 from being written to the activity log.

Because /IPAPPLS works like a sieve, this new entry must be higher in the list than DBLNDIST so that it is filtered out.
This is done using the Appl Order. Because several entries have contiguous numbering, the new entry must be somewhere before 30932.

Selecting DBLNDIST with C, created a new definition with sequence number 30900,  Local PORT 5564 and set DELIVER RECORDS  to NONE.

Application Name Entry               
Order .................... 30900  
  Application Name Base .... DB2     
                                     
Connection Match Criteria            
Application Jobname ...... DBLNDIST
  TCP/IP Stack Jobname .....         
Local Port Range ......... 5564      
  Remote Address Range .....         
  Remote Port Range ........         
                                     
Processing Options                   
  Generate Name ............ ASIS    
  Deliver Records .......... NONE    
  Write INIT Records ....... DEFAULT 
  Write TERM Records ....... DEFAULT 
  Write zERT Records ....... DEFAULT 
  Collect Statistics? ...... YES     
  Expiry Period (minutes)    0       

 

Initially, the problem with CPU may persist somewhat due to the size of the IPLOG, and the associated overhead with housekeeping, but the BURST messages and CPU associated with gathering records should be much less. 

 

STEP3: Create a new IPLOG to resize the IPLOG back to a normal size.

Use IDCAMS to generate a new file. If the current IPLOG content is needed, doing a physical reorg can be part of that.
This can be done on a running system, or when cycling the region.

For details on how to do this, including sample JCL, please review KB 123157

Additional Information

In most instances, problems of this type are due to an application generating excess records.
In rare instances, the cause may be due to excessive FTP or Telnet traffic. 

Important to note is whether the problematic record type is CN (connection records written by the stack and controlled in /IPAPPLS), or records written directly by the  FTP client/server (FC or FS), or Telnet server (TS).
FTP and TELNET records are controlled by /PARMS IPEVENT; they are either enabled or disabled.