Redis pods failing to come up

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Redis pods present issues trying to come up after upgrade.

This issue can affect any pods or workloads that are sensitive to time synchronization.

Environment

The issue was seen in an environment where Redis is running on TKGi.

Cause

This issue can be caused by ports being blocked on UDP or TCP to the NTP server. A symptom to determine if this is the cause of the issue is time drift. When ports 123 or 1023 are blocked from communicating with NTP server, this can cause time differences between the worker nodes and pods.

Redis checks time sync between nodes during startup. The time differences can eventually cause the Redis failure. Some of the error messages found can be similar to the following:

2025-07-18 18:28:24,187 INFO bootstrap MainThread: Sending a new node join request to the master [email protected], validate_only: False
2025-07-18 18:28:24,205 INFO bootstrap MainThread: Node join response received. Status code: 406
2025-07-18 18:28:24,205 INFO bootstrap MainThread: Node join response error code: time_not_sync
2025-07-18 18:28:24,205 WARNING bootstrap_mgr MainThread: Bootstrap failed: [time_not_sync][System time is not synchronized]

On the Kubernetes side you can find scenarios similar to the following:

One or more Redis pods will show restarts.

$ kubectl get pod
NAME                                       READY   STATUS        RESTARTS        AGE    
redis-0                                    1/2     Running       1 (3m30s ago)   9m40s   
redis-1                                    2/2     Terminating   2 (44m ago)     57m     
redis-2                                    2/2     Running       0               59m     
redis-services-rigger                      1/1     Running       0               124m    
redis-enterprise                           2/2     Running       0               57m

When describing the pods that present restarts you can see under the events section events logged similar to the following:

Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Created                 11m                   kubelet                  Created container bootstrapper
  Normal   Started                 11m                   kubelet                  Started container bootstrapper
  Warning  Unhealthy               11m                   kubelet                  Readiness probe failed: /opt/redislabs/bin/python3: can't open file '/opt/redislabs/shared/health_check.py': [Errno 2] No such file or directory
  Warning  Unhealthy               6m24s (x35 over 11m)  kubelet                  Readiness probe failed: node id file does not exist - pod is not yet bootstrapped

Checking chronyc you can also see scenarios similar to the following:

System clock synchronized with a value of "no" indicates that the clock is not in sync.

A Reach value of 0 means that the chronyd did not get any valid responses from the NTP server. For more details on the Reach value please review the official Chrony documentation.

The documentation indicates that this may be related to firewall port blocking.

Resolution

Review firewall rules that manage traffic to the NTP server and verify that ports 123 and 1023 are open for UDP and TCP traffic.