System clock synchronization failed on Diego Cell VMs

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

In a TPCF foundation running on Azure IaaS, some application instances have been observed to have time drift.

The output of the command timedatectl, on some diego_cell VMs, showed "System clock synchronized: no".

$ timedatectl
               Local time: Thu 2025-11-06 18:19:43 UTC
           Universal time: Thu 2025-11-06 18:19:43 UTC
                 RTC time: Thu 2025-11-06 18:19:42
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

The output of "chronyc tracking" command showed the following. It showed a Reference ID of 00000000, a Ref time of the start of Unix epoch, and Leap status is 'Not synchronised'. These indicate that chrony was not able to synchronized from any time source.

$ chronyc tracking
Reference ID    : 00000000 ()
Stratum         : 0
Ref time (UTC)  : Thu Jan 01 00:00:00 1970
System time     : 0.000000062 seconds slow of NTP time
Last offset     : +0.000211340 seconds
RMS offset      : 0.000577809 seconds
Frequency       : 30.576 ppm slow
Residual freq   : +0.000 ppm
Skew            : 0.000 ppm
Root delay      : 1.000000000 seconds
Root dispersion : 1.000000000 seconds
Update interval : 8.0 seconds
Leap status     : Not synchronised

The output of "chronyc -n status -v" showed the following. The output showed PHC0 (Precision Time Protocol Hardware Clock on Azure hypervisor) and a single NTP server that was configured in the Bosh Tile settings. Both of them has 'x' in the second column (Source state), which means that both sources were not agreeing on time or have a difference significant enough for chrony to mark them as 'may be in error', hence the VM's system clock is not synchronized.

$ chronyc -n sources -v

  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current best, '+' = combined, '-' = not combined,
| /             'x' = may be in error, '~' = too variable, '?' = unusable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
#x PHC0                          0   3   377     5  -6596ms[-6596ms] +/-   12us
^x 10.xxx.yy.zzz                 1  10   377   119  -6603ms[-6603ms] +/- 4079us

The logs of chrony showed the error "no majority".

Nov 14 15:50:25 xyz123 chronyd[235887]: Selected source PHC0
Nov 14 15:50:33 xyz123 chronyd[235887]: Can't synchronise: no majority
Nov 14 15:50:41 xyz123 chronyd[235887]: Selected source PHC0
Nov 14 15:50:57 xyz123 chronyd[235887]: Can't synchronise: no majority
Nov 14 15:51:50 xyz123 chronyd[235887]: Selected source PHC0
Nov 14 15:53:45 xyz123 chronyd[235887]: Can't synchronise: no majority

Environment

VMware Tanzu Platform - Cloud Foundry

Azure

Cause

Both sources of time, PHC0 and a NTP server, were having a difference significant enough for chrony to mark them as 'may be in error'. One of the sources could be having issues with time. The issue is between PHC0 and the NTP server not agreeing on time. Since there are only two sources, even if one of them has the correct time, chrony failed with the error "Can't synchronise: no majority".

Resolution

A few options to try to resolve this issue.

Recreate the diego_cell VM and hopefully it will recreated in another Azure Host that doesn't have hardware clock issues (assuming that the Azure Host has the issue).
PHC0 would have to be checked in the Azure Host, and you would need to reach out to Azure support on this.
Reach out to your team that manages the NTP server to check that as well if needed. If the other VM's using the same NTP server don't have this issue, then it is more likely that the problem is with PHC0 on the particular VM.
Add another NTP server to get around the majority issue. If two NTP servers agree on time in this case, then that will have a majority (2 NTP servers vs. 1 PHC0) that chrony can use to synchronize the system clock.