config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service""
search cancel

config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service""

book

Article ID: 297512

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

It is observed that the config-provider fails with the error "no endpoints available for service "cartographer-conventions-webhook-service"". And sometimes, the issue can correct itself.
$ tanzu apps workload get test --namespace lab
......
:package: Supply Chain
name: source-test-to-url

NAME READY HEALTHY UPDATED RESOURCE
source-provider True True 10m gitrepositories.source.toolkit.fluxcd.io/test
source-tester True True 10m runnables.carto.run/test
image-provider True True 9m30s images.kpack.io/test
config-provider False Unknown 9m30s not found
......
:speech_balloon: Messages
Workload [TemplateRejectedByAPIServer]: unable to apply object [lab/test] for resource [config-provider] in supply chain [source-test-to-url]: create: Internal error occurred: failed calling webhook "podintents.conventions.carto.run": failed to call webhook: Post "https://cartographer-conventions-webhook-service.cartographer-system.svc:443/mutate-conventions-carto-run-v1alpha1-podintent?timeout=10s": no endpoints available for service "cartographer-conventions-webhook-service"
 When we check the Cartographer Convention controller manager pod, it fails with the status CrashLoopBackOff or OOMKilled
$ kubectl -n cartographer-system get pod cartographer-conventions-controller-manager-xxx -o yaml
......
     containerStatuses:
     - containerID: containerd://abc123
       image: sha256:efg123
       imageID: REPO/tanzu-application-platform-1.7.0/tap-packages@sha256:hkq123
       lastState:
         terminated:
           containerID: containerd://xyz123
           exitCode: 137
           finishedAt: "2024-02-16T04:54:30Z"
           reason: OOMKilled
           startedAt: "2024-02-15T00:41:39Z"
       name: manager
       ready: true
       restartCount: 9
       started: true
       state:
         running:
           startedAt: "2024-02-16T04:54:31Z"


Resolution

This is a TAP 1.7 known issue:  
  • Cause: This error usually occurs when a workload image, built by the supply chain, contains a large SBOM. The default resource limit set during installation might not be large enough to process the pod conventions which can lead to the controller pod crashing.
  • Workaround: Currently in TAP 1.7, the default memory limit for convention server is 256Mi. To increase the memory limit for convention server, see Increase the memory limit for convention server.
  • Permanent fix: Resource limits in TAP 1.8 have been increased to mitigate this issue for cartographer convention. The default memory limit for convention server in TAP 1.8 would be 512Mi.