Time sync issues on TKGm which is VMC on AWS based

Products

VMware Cloud on AWS VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management

Issue/Introduction

You notice after SSH to your TKGm cluster's control plane node(s) of which are VMC on AWS based that the command 'date' is provided the time in UTC however there is a drift in time. In this scenario the time drift was between 5 and 10 minutes.

You also notice that from your jumpbox server when performing kubectl queries or upgrades on your clusters on TKGm that are VMC on AWS you are seeing their AGE is not displayed but you see <invalid> displayed instead.

Environment

TKGm 2.5.2

VMC on AWS

Cause

The chronyd.service's /etc/chrony/chrony.conf file has been configured with pool.ntp.org which is incorrect for VMC on AWS.

When viewing the journalctl logs for chronyd you see below messages:

May 30 15:11:04 #####-control-plane-##### chronyd[583]: Selected source 212.71.253.212 (pool.ntp.org)
May 30 15:12:10 #####-control-plane-##### chronyd[583]: Source 217.154.60.177 replaced with 109.74.206.120 (pool.ntp.org)
May 30 15:15:22 #####-control-plane-##### chronyd[583]: Can't synchronise: no majority
May 30 15:21:15 #####-control-plane-##### chronyd[583]: Selected source 109.74.206.120 (pool.ntp.org)

When SSH'd to the control plane and after installing netcat for instance, the nc -vzu <pool.ntp.org_IP> 123 will time out but will not with the AWS NTP IP 169.254.169.123.

Note - below found in KB 329764

Pre Req: Add a Firewall rule in your Compute Gateway which allows NTP traffic to 169.254.169.123
Sample Rule:
Source: Compute Workload VM/Segment
Destination: 169.254.169.123
Services: NTP (UDP:123)
Applied To: Internet Interface or Direct Connect Interface (i.e.The interface where the default route is pointing - if it is not advertised over a direct connect, it will be the Internet Interface.). In this example, we do not have a DX connection to SDDC, so the rule is applied to the Internet Interface.
Note: If you have a default route advertised over a VPN, then you wouldn't be able to use the native Amazon Time Sync Service

Resolution

Be sure that a Firewall rule in your Compute Gateway which allows NTP traffic to 169.254.169.123.

Follow the below procedure to correct the time sync on the problem cluster and it's nodes.

For management and workload legacy clusters using ytt overlay:

Create ytt overlay; replace 169.254.169.123 by your NTP server hostname/IP

$ cat > ~/.tanzu/tkg/providers/ytt/03_customizations/add_ntp.yaml <<EOF
#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")

#@overlay/match by=overlay.subset({"kind":"KubeadmControlPlane"})
---
spec:
  kubeadmConfigSpec:
    #@overlay/match missing_ok=True
    ntp:
      enabled: true
      servers:
      - 169.254.169.123

#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"}),expects="1+"
---
spec:
  template:
    spec:
      #@overlay/match missing_ok=True
      ntp:
        enabled: true
        servers:
        - 169.254.169.123
EOF

Dryrun the cluster creation and verify your NTP server is configured.


$ tanzu cluster create dryrun-cluster --dry-run --file cluster-config.yaml > dryrun-cluster.yaml

$ cat dryrun-cluster.yaml | yq e 'select(.kind == "KubeadmControlPlane") | .spec.kubeadmConfigSpec.ntp' -
enabled: true
servers:
  - 169.254.169.123

Create the cluster
Once the cluster is created, there are more data points to verify your NTP server is taken.
- The NTP configuration is stored in the cloud-init script which is used to provision the IaaS VM. TKG management cluster stores the cloud-init script as Secret named by the corresponding Machine.
```
$ kubectl get secrets mgmt-control-plane-98zlp -o json | jq '.data.value' -r | base64 -d | grep ntp -A 4
ntp:
  enabled: true
  servers:
    - 169.254.169.123
```
- The NTP configuration is taken up by the chrony service in default Ubuntu VMs.
```
$ cat /etc/chrony/chrony.conf | grep server
# Use servers from the NTP Pool Project. Approved by Ubuntu Technical Board
# servers
server 169.254.169.123 iburst
```

For management and workload ClusterClass clusters:

1. When creating the cluster, add the following information for your NTP server variables into your Configuration File Variables https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid/2-5/tkg/config-ref.html#vsphere NTP_SERVERS: "169.254.169.123"

2. It will generate the cluster with its Class-based Object Structure after creating the cluster like the following:

kind: Cluster
spec:
topology:
    variables:
    - name: ntpServers
      value:
      - "169.254.169.123"

3. On the nodes of the cluster, it will provide the ntp server setting in chrony service like the following:

$ cat /etc/chrony/chrony.conf | grep server
# Use servers from the NTP Pool Project. Approved by Ubuntu Technical Board
# servers
server 169.254.169.123 iburst

Workaround:
If the TKG clusters have been already created and you want to modify the NTP parameters without downtime, you may edit the /etc/chrony/chrony.conf with the NTP server and restart chronyd.service. Be aware this workaround is not persistent if the VM gets recreated.

$ vim /etc/chrony/chrony.conf
# Use servers from the NTP Pool Project. Approved by Ubuntu Technical Board
# servers
server 169.254.169.123 iburst

$ systemctl restart chronyd.service