TIM stopped due to Kernel panic not syncing: OOM and no killable process.

book

Article ID: 48712

calendar_today

Updated On:

Products

APP PERF MANAGEMENT CA Application Performance Management Agent (APM / Wily / Introscope) CUSTOMER EXPERIENCE MANAGER INTROSCOPE

Issue/Introduction

Description:

The MaxMemory (i.e. 700MB) configured in watchdog.xml is only for Tim's data segment/heap size and it is not for process memory size, which include other segments like text,stack etc.

In summary the Tim keeps track of it's heap size for every sec by using sbrk(0) system call and timwatcher kills Tim when it crosses the configured limit. Heap size will be logged in protocolstats log file in the format "mem: XX.XX" MB. From TIM side there is no restriction except data segment/heap size (i.e.MaxMemory:700MB). So, the tim process will continue to run until the virtual memory is exhausted.

In the case where there is a lot of out-of-order TCP Bytes in the queue (which is possibly due to dropped packets), the free memory reported by "free -lm" is greatly reduced even though there are no processes using huge amounts of memory. TCP uses sequence numbers and acknowledgements to handle packets that for whatever reason are not successfully delivered to the recipient. If a packet is dropped before it reaches the switch that Tim is connected to, the intended recipient of the packet will not receive it and so it will not send an acknowledgement, causing the sender to resend it after some timeout. Tim will notice that any following packets are out of order, and will hold them in its out-of-order queue until the missing data is received, at which point it will put the data in the right order and process it. (This is a slight oversimplification. Data is acknowledged by byte range, not by packet, but Tim handles this correctly even if the resent data is divided into packets differently than the original data.)

On the other hand, If the operating system on the Tim machine drops packets because Tim cannot service them fast enough, then it is only Tim that is missing the data, not the intended recipient, so the data will never be resent. Tim's memory use and CPU use will increase as it saves more data and spendsmore time trying to match new traffic against expected retransmissions.

Also, the OS itself reserves some memory for kernel, context-swapping, caching etc and it will not tracked by any process or PS command. For more details, Please go through the following details.
http://slashzeroconf.wordpress.com/2008/06/12/profile-memory-in-a-linux-system/
http://virtualthreads.blogspot.com/2006/02/understanding-memory-usage-on-linux.html

Solution:

Reducing the component timeout to 60 would help in reducing the likelihood of TCP Out-Of-Order Bytes problem, and in turn reducing potential open connections. This will hence reduce TIM memory usage and prevent it from going OOM. The component timeout can be set with the ConnectionTimeoutInSeconds property found in TIM System Setup > Configure Tim Settings

Environment

Release:
Component: APMCM