The Aria Automation UI becomes inaccessible after restarting the environment from VMware Aria Suite Lifecycle Manager, as the ccs-k3s-app-<ID> pod enters a CrashLoopBackOff state.
search cancel

The Aria Automation UI becomes inaccessible after restarting the environment from VMware Aria Suite Lifecycle Manager, as the ccs-k3s-app-<ID> pod enters a CrashLoopBackOff state.

book

Article ID: 440167

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Accessing the Aria Automation UI shows "This page can't be found".
  • Running kubectl get pods -n prelude reveals that the 'ccs-k3s-app-<ID>' pod is stuck in a "CrashLoopBackOff" 
NAMEREADYSTATUSRESTARTSAGE
ccs-k3s-app-aaaaaaaaaa-bbbbb1/2CrashLoopBackOff186 (2m26s ago)13h
ccs-k3s-app-ccccccccccc-ddddd2/2Running153 (155m ago)13h
ccs-k3s-app-eeeeeeeeee-fffff 1/2CrashLoopBackOff187 (3m24s ago)13h
  • The /var/log/deploy.log shows the orchestration failing during the prelude backend service phase: RuntimeError: retry timeout exceeded for helm_wait_check.
    [####-##-## 12:45:13] ERROR Release 'ccs-k3s' in namespace 'prelude' failed to come up
    Traceback (most recent call last):
      File "/opt/scripts/helm-upstall", line 286, in main
        helm_wait(namespace, release_name, timeout=timeout)
      File "/opt/scripts/helm-upstall", line 127, in helm_wait
        helm_wait_check()
      File "/opt/python-modules/vracli/decorators.py", line 161, in wrapper
        raise RuntimeError(f'retry timeout exceeded for {f_name}') from err
    RuntimeError: retry timeout exceeded for helm_wait_check 

     

  • At the /services-logs/prelude/ccs-k3s-app/files-logs/k3s-server-app.log shows an error about a TLS handshake error:
    time="####-##-##T15:31:47Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="tls: handshake message of length 65680 bytes exceeds maximum of 65536 bytes"
  • Diagnostic traces from the ccs-k3s-app pod specifically identify the secret kube-system/k3s-serving as containing the bloated SAN count, This is producing an excessively large certificate chain (1 certificate per SAN).
    time="####-##-##T15:31:47Z" level=info msg="Active TLS secret kube-system/k3s-serving (ver=69513035) (count 2731): map[listener.cattle.io/cn-##.###.#.##:##.###.#.## listener.cattle.io/cn-##.###.#.##:##.###.#.##

Environment

  • Aria Automation 8.18.x
  • VCF Automation 9.x

Cause

  • The 'ccs-k3s' service generates two new Subject Alternative Name (SAN) entries (Pod IP and Pod Name) for its internal TLS certificate during every pod startup event.
  • In environments with frequent pod restarts, these SAN entries continuously accumulate within the 'k3s-serving secret'.
  • Over time, the certificate chain size exceeds the 64 KB (65,536 bytes) limit hardcoded in the Go runtime, resulting in TLS handshake failures and preventing the 'ccs-k3s service' from starting successfully.
  • Since the Aria Automation UI service deployment depends on successful startup of all Prelude backend services, the UI deployment does not proceed when the 'ccs-k3s service' remains unavailable. Consequently, the Aria Automation UI remains inaccessible.

Resolution

  • The reported issue is currently being addressed in the following upcoming releases:
    • Aria Automation 8.18.1 P6
    • VCF Automation 9.1.1.0
    • VCF Automation 9.2
  • ETA is currently not available.
    • Note: Monitor the upcoming product announcements for updates on availability.

Workaround:

Step 1: Verify Existing Prelude Pods

  • Run the following command to list all pods in the prelude namespace: 'kubectl -n prelude get pods | grep ccs-k3s'
NAMEREADYSTATUSRESTARTSAGE
ccs-k3s-app-aaaaaaaaaa-bbbbb1/2CrashLoopBackOff186 (2m26s ago)13h
ccs-k3s-app-ccccccccccc-ddddd2/2Running153 (155m ago)13h
ccs-k3s-app-eeeeeeeeee-fffff 1/2CrashLoopBackOff187 (3m24s ago)13h

Capture the names of all running ccs-k3s pods.

Step 2: Obtain Cluster IP of ccs-k3s Service

  • Run the following command: 'kubectl -n prelude get service ccs-k3s'
  • Capture the Cluster IP from the output.

Step 3: Validate Existing Certificate

  • Run the following command using the Cluster IP collected in the previous step: 'echo | openssl s_client -showcerts -connect <Cluster_IP>:6443 2>/dev/null | openssl x509 -text | grep DNS'
    #Response
    
    s_client: cannot provide both -connect option and target parameter
    s_client: Use -help for summary.
    Could not read certificate from <stdin>
    Unable to load certificate
  • The command returns: Unable to load certificate

Step 4: Access Running ccs-k3s Pod

  • Use one of the pod names collected in Step 1 and execute: 'kubectl -n prelude exec -it <ccs-k3s-pod-id> -- bash'
    #Response:
    Defaulted container "ccs-k3s-app" out of: ccs-k3s-app, nginx-proxy, ccs-k3s-dependencies (init)

Step 5: Verify Existing k3s Serving Secret

  • Inside the pod, run: 'kubectl -n kube-system get secret k3s-serving'
    [ / ]# kubectl -n kube-system get secret k3s-serving
    NAME              TYPE               DATA    AGE
    k3s-serving     kubernetes. io/tls    2      2y2d
  • To inspect the secret contents run : 'kubectl -n kube-system get secret k3s-serving -o yaml'
    #Sample Output
    listener.cattle. io/cn-ccs-k3s-app-xxxxxxxx-ssssss :    ccs-k3s-app-xxxxxxxx-Ssssss
    listener. cattle. io/cn-ccs-k3s-app-xxxxxxxx-tttttt:    ccs-k3s-app-xxxxxxxx-tttttt
    listener.cattle.io/cn-ccs-k3s-app-xxxxxxxx-uuuuuu :    ccs-k3s-app-xxxxxxxx-uuuuuu
    listener . cattle. io/cn-ccs-k3s-app-xxxxxxxx-VVVVVV:    ccs-k3s-app-xxxxxxxx-VVVVVV
    listener . cattle. io/cn-ccs-k3s-app-xxxxxxxx-wwwwww:    ccs-k3s-app-xxxxxxxx-wwwwww
    listener.cattle. io/cn-ccs-k3s-app-xxxxxxxx-xxxxxx :    ccs-k3s-app-xxxxxxxx-xxxxxx
    listener.cattle.io/cn-ccs-k3s-app-xxxxxxxx-yyyyyy:    ccs-k3s-app-xxxxxxxx-yyyyyy
    listener. cattle.io/cn-ccs-k3s-app-xxxxxxxx-zzzzzz:    ccs-k3s-app-xxxxxxxx-ZZzzzz
    listener. cattle.io/cn-kubernetes: kubernetes
    listener.cattle.io/cn-kubernetes. default: kubernetes.default
    listener.cattle. io/cn-kubernetes. default.svc: kubernetes.default.svc
    kubernetes. default. svc.cluster. local
    
    listener.cattle.io/fingerprint: SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    creationTimestamp: "2024-05-10T22:43:33Z"
    name: k3s-serving
    namespace: kube-system
    resourceVersion: "123456789"
    uid: 12345678-1234-1234-1234-123456789abc
    type: kubernetes.io/tls
    • Note: The secret contains a very large number of certificates.

Step 6: Delete Existing k3s Serving Secret

  • Run the following command: 'kubectl -n kube-system delete secret k3s-serving'
  • This forces regeneration of the serving certificates.

Step 7: Verify Pod Restart

  • Check the pod status: 'kubectl -n prelude get pods | grep ccs-k3s'
    NAMEREADYSTATUSRESTARTSAGE
    ccs-k3s-app-111111111112/2Running187 (6m16s ago)14h
    ccs-k3s-app-222222222222/2Running153 (159m ago)14h
    ccs-k3s-app-333333333331/2Running189 (14s ago)14h
  • Deleted pod restarts automatically and appears with recent restart time.

Step 8: Delete Remaining Old Pods

  • Delete the remaining old pods noted in Step 1: 'kubectl -n prelude delete pod <pod_name>'
    # kubectl -n prelude delete pod ccs-k3s-app-11111111111
    pod "ccs-k3s-app-11111111111"deleted
    
    # kubectl -n prelude delete pod ccs-k3s-app-22222222222
    pod "ccs-k3s-app-22222222222"deleted

Step 9: Delete the pod which was restarted first in Step 7 

  • 'kubectl -n prelude delete pod <pod_name>'
    #kubectl -n prelude delete pod ccs-k3s-app-333333333333
    pod "ccs-k3s-app-333333333333" deleted
  • Wait for all pods to come back to running state.

Step 10: Verify Regenerated Certificates

  • Run the following command again: 'kubectl -n kube-system get secret k3s-serving -o yaml'
  • Verify that the number of certificates is now significantly reduced and appears normal.

Step 11: Power On Environment from Aria Suite lifecycle manager.

  • Log in to VMware Aria Suite Lifecycle Manager (LCM).
  • Perform the Power On action for the environment.

Step 12: Monitor Prelude Deployment

  • SSH into the Aria Automation node and monitor the deployment log:
    tail -f /var/log/deploy.log
  • Observe the Prelude deployment progress and wait for successful completion.

Step 13: Verify Aria Automation UI is accessible