Controller node status is Not Ready

Products

VMware Integrated OpenStack

Issue/Introduction

Unable to ping VIO controller Node IP.
Pods go into pending state. Status of VIO Management Server or controller changes to not ready.
Possible network congestion, drop or disconnect.
kubelet log entries will have "use of closed network connection" mentioned

Jun 13 18:14:40 controller-XXXXXXXXXXX kubelet[533]: I0613 18:14:01.455550 533 trace.go:116] Trace[999980923]: "Reflector ListAndWatch" name:object-"openstack"/"nova-etc" (started: 2022-06-13 18:13:46.325761188 +0000 UTC m=+13284809.189135873) (total time: 15.129778023s): Jun 13 18:14:40 controller-XXXXXXXXX kubelet[533]: Trace[999980923]: [15.129778023s] [15.129778023s] END Jun 13 18:14:40 controller-XXXXXXXXX kubelet[533]: E0613 18:14:01.455555 533 reflector.go:153] object-"openstack"/"nova-etc": Failed to list *v1.Secret: Get https://xxx.xxx.xxx.xxx:6443/api/v1/namespaces/openstack/secrets?fieldSelector=metadata.name%3Dnova-etc&limit=500&resourceVersion=0: read tcp xxx.xxx.xxx.xxx:20930->xxx.xxx.xxx.xxx:6443: use of closed network connection'

Note: The above error will be associated with all the services like Nova, Glance, heat, rabbitmq, Horizon-token etc...

Environment

7.x

Cause

Possible network congestion, drop or disconnect (Could be hardware or soft reset because of drivers) or slow drain devices. Intermittent network issues will bring the network down on controller node. The controller will not recover even after the network is back. We will have to manually restart the kubelet services to bring controller IP back online.

Resolution

This is a known issue affecting VMware Integrated Openstack 7.x.

Workaround:

Manually restart the kubelet service on the affected controller node

#viossh <controller node with problem>
#systemctl restart kubelet

A Cron job can be created to looks at logs and restart the services.

Create a script with any desired name with .sh extension and below should be the contents of the file :

#!/bin/bash
output=$(journalctl -u kubelet -n 10 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
echo "Error not found in logs"
elif [[ $output ]]; then
echo "Restart kubelet"
systemctl restart kubelet
fi

Note: In the above script -n indicates the the number of lines to be retrieved and at the time of the issue we will have a lot of entries which will let the kubelet service get restarted. Once the services are restarted these errors will not be present and we need worry about kubelet service restart at every run even though the errors are still present in the logs.

Make the shell script executable by using the command:

#chmod 777 filenmane.sh

Run crontab -l to check the current details. Create a cronjob to run the script every 5 minutes or less frequently if customer prefers it that way. Below is the procedure to get the same configured :

#crontab -e
add the below line :
*/5 * * * * /path_to_script/filename.sh >/dev/null 2>&1 &>>/path_for_output_file/xxx.txt

In some instances we will not have the crontab service installed.

#tdnf install -y cronie
#systemctl enable --now crond.service

If the controllers do not have internet access, download the cronie rpm installation package from this link and manually transfer it to controller : https://packages.vmware.com/photon/3.0/photon_release_3.0_x86_64/x86_64/cronie-1.5.1-1.ph3.x86_64.rpm

To install:
rpm --install cronie-1.5.1-1.ph3.x86_64.rpm --noscripts
systemctl start cron

Additional Information

Impact/Risks:
Pods go into pending state. Status of VIO Management Server or controller changes to not ready.