Node instability due to containerd v1.6.6 memory leak
search cancel

Node instability due to containerd v1.6.6 memory leak

book

Article ID: 438917

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

  • Kubernetes nodes within a Telco Cloud Automation (TCA) environment experience progressive memory exhaustion.
  • Affected nodes eventually reach critical memory utilization (e.g., >95%), leading to node instability, crashes, or reboots.
  • The containerd process shows a significantly high Resident Set Size (RSS), sometimes reaching over 100 GB.
  • System uptime is typically high (e.g., 200+ days).
  • Applications running on the nodes may fail or be evicted due to Out-Of-Memory (OOM) conditions on the host.

Environment

3.2

Cause

The issue is caused by a known internal memory leak within the containerd daemon version 1.6.6. Specifically, the leak occurs within the Container Runtime Interface (CRI) and Task Service. Over long periods of uptime, the Go runtime heap grows uncontrollably as memory pages are trapped and not properly reclaimed by the daemon.

Resolution

Temporary Mitigation

To immediately reclaim memory and restore node stability without a full upgrade, restart the containerd service on the affected node:

systemctl restart containerd

Note: Make sure that the node is cordoned and drained before restarting the service.

The permanent resolution requires upgrading the TCA to 3.3.0.1 version with Kubernetes 1.30 .

Containerd is tied to the BYOI template and the TKG VM. New BYOI templates are introduced in TCA 3.3.0.1 version which will have the new containerd version. The containerd version within that is 1.7.29 which will resolve the issue

 

Additional Information

Supported doc containerd version v1.7.29