Partner Gateway kernel hung problem
search cancel

Partner Gateway kernel hung problem

book

Article ID: 385263

calendar_today

Updated On:

Products

VMware VeloCloud SD-WAN

Issue/Introduction

Gateway is down due to it going to hung state where all the process are stuck. The gateway recovers post reboot

Environment

This is seen on partner gateway, where the gateway is seen down and on logging in to the console we can see kernel hung message as "/proc/sys/kernel/hung_task_timeout_secs"

Cause

The issue is caused due to Slow I/O - (ref: https://thelinuxcluster.com/2023/08/24/having-kernel-hung_task_timeout_secs-issues/)
```
By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. As IO subsystem responds slowly and more requests are served, System Memory gets filled up resulting in the above error

Resolution

Ideally the dirty_background_ratio (This is asynchronous and the application can continue while the kernel flushes the dirty pages to disk) should be lesser than the dirty_ratio (This is synchronous writeback and this will block all other processes). When these values are equal, the system may not differentiate between when to flush in background and when to block