Worker nodes time drift with configured and accessible NTP servers

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Clusters are facing time drifting

Discovered time difference between worker node time and the system time. The difference is discovered in small deviations between 200 - 500 milisecons and correct after some time

in /var/log# cat syslog | grep chronyd following messages can be seen:

daemon.log:2026-02-25T09:52:15.213648+00:00 <ID> chronyd[1042]: Selected source 10.xx.xx.97
daemon.log:2026-02-25T11:52:41.041068+00:00 <ID> chronyd[1042]: Can't synchronise: no majority
daemon.log:2026-02-25T12:27:22.316887+00:00 <ID> chronyd[1042]: Selected source 10.xx.xx.129

Environment

TKGi 1.23

Opsman 3.x

Cause

NTP servers connectivity problem and NTP servers time discrepancy

Resolution

Usually messages of type "Can't synchronise: no majority" represents disagreement:

This is a protective mechanism in NTP. Chrony doesn't just trust the first server it sees; it compares all available sources to find a "consensus" (the true time).

What happened: Chrony looked at the configured NTP servers, and their reported times were too far apart from each other.

The "No Majority" Rule: If you have 3 servers and they all report significantly different times, Chrony cannot determine which one is "lying" and which one is correct. To avoid setting the system clock to an incorrect time, it refuses to synchronize at all.

Common Cause: This often happens if only 2 ntp servers are configured and they disagree, or if network jitter is making the responses inconsistent.

Additional Information

Steps to determine the true cause:

Verify the time on the cluster the result will show the time has no deviations, however it will not detect drift of miliseconds:

date +%F_%T%z ; bosh -d service-instance_<ID> ssh --command="date +%F_%T%z" | sort -s -k1,1 | grep -vE 'Unauthorized use is strictly prohibited|is subject to logging and monitoring|Connection to.*closed' ; date +%F_%T%z
2026-02-25_14:49:45+0000
master/: stdout | 2026-02-25_14:49:48+0000
master/: stdout | 2026-02-25_14:49:48+0000
master/: stdout | 2026-02-25_14:49:48+0000
worker/: stdout | 2026-02-25_14:49:48+0000
worker/: stdout | 2026-02-25_14:49:48+0000
worker/: stdout | 2026-02-25_14:49:48+0000
2026-02-25_14:49:49+0000

Verify the status of NTP servers and if the servers are reachable from the workers

chronyc tracking 

Reference ID    : 0A22FE81 (SERVER)
Stratum         : 3
Ref time (UTC)  : Wed Feb 25 14:37:29 2026
System time     : 0.000008814 seconds fast of NTP time
Last offset     : +0.000035635 seconds
RMS offset      : 0.041248817 seconds
Frequency       : 18.524 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.041 ppm
Root delay      : 0.001787941 seconds
Root dispersion : 0.002168136 seconds
Update interval : 1032.2 seconds
Leap status     : Normal

Get chronyc sources shows the problem where two servers are present but one of them is either not responding or there are network packets dropped

chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* SERVER1>.                         2   9    75   979    +28us[  +63us] +/- 3003us
^- SERVER2>                          2  10   325   790    +32us[  +32us] +/- 2810us

=====================

Another sample of above command

chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^- SERVER1>                         2  10   205   25m   -336us[ -336us] +/- 2555us
^* SERVER2>                         2  10   321   28m   -466us[ -466us] +/- 2485us

The samples indicate:

The Reach value is an 8-bit octal number representing the last 8 connection attempts.

Perfect score is 377 (binary 11111111), meaning the last 8 polls were all successful.

Scores of (205 and 321) indicate that some recent packets were dropped or lost. This explains "no majority" log—when packets drop,

Chrony loses the data points it needs to stay confident.

Also second example show some latency:

25m and 28m: It has been nearly half an hour since last server successfully received a time update from these sources.

Ideally, with a Poll interval of 10 (which is $2^{10}$ or 1024 seconds / ~17 minutes), there should updates more recently than 28 minutes. This reinforces that the network connection to these specific IPs is inconsistent.

Finally if needed to validate the NTP behaviour and confirm the time sync completes packet capture can be used on a cluster

bosh -d service-instance_<ID> pcap worker --interface eth0  --filter "eth0 udp port 123"   --output ./ntp_capture.pcap

As the sync is completed every 17 min or so in order to trigger burst of sync while running packet capture

bosh -d service-instance_<ID> ssh -c "sudo chronyc burst 4/4 ;sleep 1; sudo chronyc sources" | grep stdout
worker/: stdout | 200 OK
master/: stdout | 200 OK
worker/: stdout | 200 OK
worker/: stdout | MS Name/IP address         Stratum Poll Reach LastRx Last sample
worker/: stdout | ^* ntp.example.com             2  10   377     1  -4903us[-8373us] +/-  144ms
worker/: stdout | ^* ftc-ntp-1.example.com       2  10   377     1   +210us[ +373us] +/-   28ms
master/: stdout | ^* acc-ntp-1.example.com       2   9   377     1   +469us[ +951us] +/-   36ms