Aria Automation Pods Fail with ErrImageNeverPull or Image Pull Authentication Errors

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Symptoms

After an appliance restart or a change in the environment, some Aria Automation pods fail to initialize. Running the command kubectl get pods -n prelude shows key services in a persistent ContainerCreating, Init:0/1, or ErrImageNeverPull status. Examples of affected pods include:
- ccs-k3s-post-install-job-...
- idem-service-worker-...

When inspecting a failing pod with kubectl describe pod <pod-name> -n prelude, the Events section at the bottom shows a misleading "pull access denied" or authentication-related error, even though Aria Automation uses local images. The error may look similar to this:

Failed to pull image "image-name:version": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/image-name:version": failed to resolve reference "docker.io/library/image-name:version": pull access denied, repository does not exist or may require 'docker login'

Prior to the issue, you may have observed high resource usage on the appliances, such as memory or disk utilization exceeding 90%.

Environment

VMware Aria Automation 8.x

Cause

The primary cause of this issue is an automated, self-preservation mechanism within the appliance's Kubernetes environment.

When an appliance node experiences high resource consumption (typically over 80-90% disk or memory utilization), Kubernetes will automatically begin a process called "image garbage collection." This process deletes local container images that are not currently in use to free up space and prevent a critical system failure.

While this is normal behavior, it means that if services are restarted later (either manually or via an automated process), the required images are no longer present on the node, leading to the ErrImageNeverPull state.

A less common cause is a storage or network outage that can lead to a corrupted local image cache, producing the same symptoms.

Resolution

The solution is to restore the missing container images from the appliance's local archive on each affected node. This will allow Kubernetes to start the pods successfully.

Identify the Affected Nodes: First, SSH into any node in your Aria Automation cluster. You can find where failing pods are located using one of these methods:
- To see all pods and nodes at once, use the -o wide flag:
```
kubectl get pods -n prelude -o wide
```
- To get details for one specific pod, use kubectl describe. Look for the "Node:" line in the output to see which appliance it's assigned to.
```
kubectl describe pod <pod-name> -n prelude
```
Note the names of the nodes (e.g., vra-node-01, vra-node-02) where pods are in an ErrImageNeverPull or other error state.
Restore Images on Each Affected Node: SSH directly into each node you identified in the previous step. Execute the following script to restore the container images. This script will unpack and load all necessary images into the local cache.
```
/opt/scripts/restore_docker_images.sh
```
You must run this command on every node that has failing pods.
Verify the Fix: After running the script on a node, Kubernetes will automatically detect the presence of the new images and attempt to restart the failed pods. You can monitor the progress from any node by running:
```
watch kubectl get pods -n prelude
```
The pods should transition from ErrImageNeverPull to ContainerCreating, and finally to a healthy Running state within a few minutes. Once all pods are running, the cluster will return to a healthy state.