config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service""

search cancel

config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service""

book

Article ID: 297512

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

It is observed that the config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service"". And sometimes, the issue can correct itself.

$ tanzu apps workload get test --namespace lab
......
:package: Supply Chain
name: source-test-to-url

NAME READY HEALTHY UPDATED RESOURCE
source-provider True True 10m gitrepositories.source.toolkit.fluxcd.io/test
source-tester True True 10m runnables.carto.run/test
image-provider True True 9m30s images.kpack.io/test
config-provider False Unknown 9m30s not found
......
:speech_balloon: Messages
Workload [TemplateRejectedByAPIServer]: unable to apply object [lab/test] for resource [config-provider] in supply chain [source-test-to-url]: create: Internal error occurred: failed calling webhook "podintents.conventions.carto.run": failed to call webhook: Post "https://cartographer-conventions-webhook-service.cartographer-system.svc:443/mutate-conventions-carto-run-v1alpha1-podintent?timeout=10s": no endpoints available for service "cartographer-conventions-webhook-service"

When we check the Cartographer Convention controller manager pod, it fails with the status CrashLoopBackOff or OOMKilled.

$ kubectl -n cartographer-system get pod cartographer-conventions-controller-manager-xxx -o yaml
......
     containerStatuses:
     - containerID: containerd://abc123
       image: sha256:efg123
       imageID: REPO/tanzu-application-platform-1.7.0/tap-packages@sha256:hkq123
       lastState:
         terminated:
           containerID: containerd://xyz123
           exitCode: 137
           finishedAt: "2024-02-16T04:54:30Z"
           reason: OOMKilled
           startedAt: "2024-02-15T00:41:39Z"
       name: manager
       ready: true
       restartCount: 9
       started: true
       state:
         running:
           startedAt: "2024-02-16T04:54:31Z"

Resolution

This is a TAP 1.7 known issue:

Cause: This error usually occurs when a workload image, built by the supply chain, contains a large SBOM. The default resource limit set during installation might not be large enough to process the pod conventions which can lead to the controller pod crashing.
Workaround: Currently in TAP 1.7, the default memory limit for convention server is 256Mi. To increase the memory limit for convention server, see Increase the memory limit for convention server.
Permanent fix: Resource limits in TAP 1.8 have been increased to mitigate this issue for cartographer convention. The default memory limit for convention server in TAP 1.8 would be 512Mi.

Feedback

thumb_up Yes

thumb_down No