The Aria Automation UI becomes inaccessible after restarting the environment from VMware Aria Suite Lifecycle Manager, as the ccs-k3s-app-<ID> pod enters a CrashLoopBackOff state.

search cancel

The Aria Automation UI becomes inaccessible after restarting the environment from VMware Aria Suite Lifecycle Manager, as the ccs-k3s-app-<ID> pod enters a CrashLoopBackOff state.

book

Article ID: 440167

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Accessing the Aria Automation UI shows "This page can't be found".

Running kubectl get pods -n prelude reveals that the 'ccs-k3s-app-<ID>' pod is stuck in a "CrashLoopBackOff"

NAME	READY	STATUS	RESTARTS	AGE
ccs-k3s-app-aaaaaaaaaa-bbbbb	1/2	CrashLoopBackOff	186 (2m26s ago)	13h
ccs-k3s-app-ccccccccccc-ddddd	2/2	Running	153 (155m ago)	13h
ccs-k3s-app-eeeeeeeeee-fffff	1/2	CrashLoopBackOff	187 (3m24s ago)	13h

The /var/log/deploy.log shows the orchestration failing during the prelude backend service phase: RuntimeError: retry timeout exceeded for helm_wait_check.

[####-##-## 12:45:13] ERROR Release 'ccs-k3s' in namespace 'prelude' failed to come up
Traceback (most recent call last):
  File "/opt/scripts/helm-upstall", line 286, in main
    helm_wait(namespace, release_name, timeout=timeout)
  File "/opt/scripts/helm-upstall", line 127, in helm_wait
    helm_wait_check()
  File "/opt/python-modules/vracli/decorators.py", line 161, in wrapper
    raise RuntimeError(f'retry timeout exceeded for {f_name}') from err
RuntimeError: retry timeout exceeded for helm_wait_check

At the /services-logs/prelude/ccs-k3s-app/files-logs/k3s-server-app.log shows an error about a TLS handshake error:

time="####-##-##T15:31:47Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="tls: handshake message of length 65680 bytes exceeds maximum of 65536 bytes"

Diagnostic traces from the ccs-k3s-app pod specifically identify the secret kube-system/k3s-serving as containing the bloated SAN count, This is producing an excessively large certificate chain (1 certificate per SAN).
```
time="####-##-##T15:31:47Z" level=info msg="Active TLS secret kube-system/k3s-serving (ver=69513035) (count 2731): map[listener.cattle.io/cn-##.###.#.##:##.###.#.## listener.cattle.io/cn-##.###.#.##:##.###.#.##
```

Environment

Aria Automation 8.18.x
VCF Automation 9.x

Cause

The 'ccs-k3s' service generates two new Subject Alternative Name (SAN) entries (Pod IP and Pod Name) for its internal TLS certificate during every pod startup event.
In environments with frequent pod restarts, these SAN entries continuously accumulate within the 'k3s-serving secret'.
Over time, the certificate chain size exceeds the 64 KB (65,536 bytes) limit hardcoded in the Go runtime, resulting in TLS handshake failures and preventing the 'ccs-k3s service' from starting successfully.
Since the Aria Automation UI service deployment depends on successful startup of all Prelude backend services, the UI deployment does not proceed when the 'ccs-k3s service' remains unavailable. Consequently, the Aria Automation UI remains inaccessible.

Resolution

The reported issue is currently being addressed in the following upcoming releases:
- Aria Automation 8.18.1 P6
- VCF Automation 9.1.1.0
- VCF Automation 9.2
ETA is currently not available.
- Note: Monitor the upcoming product announcements for updates on availability.

Workaround:

Step 1: Verify Existing Prelude Pods

Run the following command to list all pods in the prelude namespace: 'kubectl -n prelude get pods | grep ccs-k3s'

NAME	READY	STATUS	RESTARTS	AGE
ccs-k3s-app-aaaaaaaaaa-bbbbb	1/2	CrashLoopBackOff	186 (2m26s ago)	13h
ccs-k3s-app-ccccccccccc-ddddd	2/2	Running	153 (155m ago)	13h
ccs-k3s-app-eeeeeeeeee-fffff	1/2	CrashLoopBackOff	187 (3m24s ago)	13h

Capture the names of all running ccs-k3s pods.

Step 2: Obtain Cluster IP of ccs-k3s Service

Run the following command: 'kubectl -n prelude get service ccs-k3s'
Capture the Cluster IP from the output.

Step 3: Validate Existing Certificate

Run the following command using the Cluster IP collected in the previous step: 'echo | openssl s_client -showcerts -connect <Cluster_IP>:6443 2>/dev/null | openssl x509 -text | grep DNS'
```
#Response

s_client: cannot provide both -connect option and target parameter
s_client: Use -help for summary.
Could not read certificate from <stdin>
Unable to load certificate
```
The command returns: Unable to load certificate

Step 4: Access Running ccs-k3s Pod

Use one of the pod names collected in Step 1 and execute: 'kubectl -n prelude exec -it <ccs-k3s-pod-id> -- bash'
```
#Response:
Defaulted container "ccs-k3s-app" out of: ccs-k3s-app, nginx-proxy, ccs-k3s-dependencies (init)
```

Step 5: Verify Existing k3s Serving Secret

Inside the pod, run: 'kubectl -n kube-system get secret k3s-serving'

[ / ]# kubectl -n kube-system get secret k3s-serving
NAME              TYPE               DATA    AGE
k3s-serving     kubernetes. io/tls    2      2y2d

To inspect the secret contents run : 'kubectl -n kube-system get secret k3s-serving -o yaml'

#Sample Output
listener.cattle. io/cn-ccs-k3s-app-xxxxxxxx-ssssss :    ccs-k3s-app-xxxxxxxx-Ssssss
listener. cattle. io/cn-ccs-k3s-app-xxxxxxxx-tttttt:    ccs-k3s-app-xxxxxxxx-tttttt
listener.cattle.io/cn-ccs-k3s-app-xxxxxxxx-uuuuuu :    ccs-k3s-app-xxxxxxxx-uuuuuu
listener . cattle. io/cn-ccs-k3s-app-xxxxxxxx-VVVVVV:    ccs-k3s-app-xxxxxxxx-VVVVVV
listener . cattle. io/cn-ccs-k3s-app-xxxxxxxx-wwwwww:    ccs-k3s-app-xxxxxxxx-wwwwww
listener.cattle. io/cn-ccs-k3s-app-xxxxxxxx-xxxxxx :    ccs-k3s-app-xxxxxxxx-xxxxxx
listener.cattle.io/cn-ccs-k3s-app-xxxxxxxx-yyyyyy:    ccs-k3s-app-xxxxxxxx-yyyyyy
listener. cattle.io/cn-ccs-k3s-app-xxxxxxxx-zzzzzz:    ccs-k3s-app-xxxxxxxx-ZZzzzz
listener. cattle.io/cn-kubernetes: kubernetes
listener.cattle.io/cn-kubernetes. default: kubernetes.default
listener.cattle. io/cn-kubernetes. default.svc: kubernetes.default.svc
kubernetes. default. svc.cluster. local

listener.cattle.io/fingerprint: SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
creationTimestamp: "2024-05-10T22:43:33Z"
name: k3s-serving
namespace: kube-system
resourceVersion: "123456789"
uid: 12345678-1234-1234-1234-123456789abc
type: kubernetes.io/tls

Note: The secret contains a very large number of certificates.

Step 6: Delete Existing k3s Serving Secret

Run the following command: 'kubectl -n kube-system delete secret k3s-serving'
This forces regeneration of the serving certificates.

Step 7: Verify Pod Restart

Check the pod status: 'kubectl -n prelude get pods | grep ccs-k3s'
NAME READY STATUS RESTARTS AGE
ccs-k3s-app-11111111111 2/2 Running 187 (6m16s ago) 14h
ccs-k3s-app-22222222222 2/2 Running 153 (159m ago) 14h
ccs-k3s-app-33333333333 1/2 Running 189 (14s ago) 14h
Deleted pod restarts automatically and appears with recent restart time.

Step 8: Delete Remaining Old Pods

Delete the remaining old pods noted in Step 1: 'kubectl -n prelude delete pod <pod_name>'

# kubectl -n prelude delete pod ccs-k3s-app-11111111111
pod "ccs-k3s-app-11111111111"deleted

# kubectl -n prelude delete pod ccs-k3s-app-22222222222
pod "ccs-k3s-app-22222222222"deleted

Step 9: Delete the pod which was restarted first in Step 7

'kubectl -n prelude delete pod <pod_name>'

#kubectl -n prelude delete pod ccs-k3s-app-333333333333
pod "ccs-k3s-app-333333333333" deleted

Wait for all pods to come back to running state.

Step 10: Verify Regenerated Certificates

Run the following command again: 'kubectl -n kube-system get secret k3s-serving -o yaml'
Verify that the number of certificates is now significantly reduced and appears normal.

Step 11: Power On Environment from Aria Suite lifecycle manager.

Log in to VMware Aria Suite Lifecycle Manager (LCM).
Perform the Power On action for the environment.

Step 12: Monitor Prelude Deployment

SSH into the Aria Automation node and monitor the deployment log:
```
tail -f /var/log/deploy.log
```
Observe the Prelude deployment progress and wait for successful completion.

Step 13: Verify Aria Automation UI is accessible

Feedback

thumb_up Yes

thumb_down No