Supervisor Control Plane VMs Experience Memory Leak caused by rsyslog
search cancel

Supervisor Control Plane VMs Experience Memory Leak caused by rsyslog

book

Article ID: 313948

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

To return the Supervisor cluster to working status.

Symptoms:
  • The vCenter housing the Supervisor Cluster uses log forwarding for either vRealize/Aria or 3rd party log servers
  • The Supervisor Cluster will not show Running status and instead is shown as configuringwith messaging similar to:

Configure operation for the Master node VM with identifier vm-XXXX is pending
 
  • The overview section for the Supervisor Cluster shows one or more SupervisorControlPlaneVMs is "Not ready"
  • Kubernetes status shows some/many deployments in vmware-system namespaces as not running. Including but not limited to:
    • vmware-system-nsop-controller-manager
    • capi-kubeadm-bootstrap-controller-manager
    • vmware-system-vmop-controller-manager
  • The monitor section for the VM object in vSphere/ESXi shows significant increase in disk usage with no environmental correlation
  • Accessing the Supervisor control plane nodes via SSH and using the top command shows memory stuck at 100% with rsyslog consuming the majority of the resources (>60%)
  • Investigating the vm console on vSphere/ESXi shows messaging similar to:
 
Out of memory: Kill process XXXXXX (python3) score YYYY or sacrifice child
Killed process XXXXXX (python3) total-vm:123456kB


Cause

This is a bug in the rsyslog version included in the rpms used to deploy Supervisor Control Plane VMs with vCenter 8.0.0 and 8.0.1. More information on the bug can be found on the rsyslog github page here.

Resolution

Engineering has confirmed this will be resolved when SupervisorControlPlaneVM rpms are upgraded to Photon version 4.0. The release date is currently not known. Subscribing to the KB article will send you emails when this gets updated.

Workaround:
Restarting the rsyslog service has been shown to be an effective workaround for this issue. The instructions to complete this are below:
  1. Access vCenter using the root user
  2. Run command: /usr/lib/vmware-wcp/decryptK8Pwd.py
  3. Using the management IP (eth0) for each affected Supervisor VM, SSH to it using the plaintext password from the previous command
  4. Once connected to the VM run command: systemctl restart rsyslog


Additional Information

Rsyslog github page: https://github.com/rsyslog/rsyslog/issues/5135
Rsyslog changeLog: https://github.com/rsyslog/rsyslog/blob/v8.2306.0/ChangeLog

Impact/Risks:
The Supervisor cluster will exhibit intermittent or total outages leaving the workload clusters unaffected.