During cluster Upgrade kube-apiserver not starting
search cancel

During cluster Upgrade kube-apiserver not starting

book

Article ID: 403462

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

We tried several different ways of restarting the Master node but kube-apiserver starts and fails immediately,

etcd seems to be fine but cannot find a way to bring the master to running state.

In the kube-api logs messages like below can be seen:

I0707 09:08:01.435396       6 healthz.go:261] poststarthook/rbac/bootstrap-roles check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
I0707 09:08:01.443460       6 healthz.go:261] poststarthook/rbac/bootstrap-roles check failed: healthz

 

Environment

TKGI 1.2x 

Cause

After investigation of kube-api logs the following snippet can be seen:

W0707 15:01:42.235278       7 dispatcher.go:217] Failed calling webhook, failing closed validate.kyverno.svc-fail: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.ccp-kyverno.svc:443/validate/fail?timeout=10s": context deadline exceeded
I0707 15:01:42.235535       7 trace.go:236] Trace[708539055]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:0xxxxxxxxxxxxxxxxd080,client:127.0.0.1,api-group:rbac.authorization.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:clusterrolebindings,scope:resource,url:/apis/rbac.authorization.k8s.io/v1/clusterrolebindings,user-agent:kube-apiserver/v1.29.6+vmware.1 (linux/amd64) kubernetes/73fc294,verb:POST (07-Jul-2025 15:01:32.233) (total time: 10001ms):
Trace[708539055]: ["Call validating webhook" configuration:kyverno-resource-validating-webhook-cfg,webhook:validate.kyverno.svc-fail,resource:rbac.authorization.k8s.io/v1, Resource=clusterrolebindings,subresource:,operation:CREATE,UID:f473axxxxxxxxxxxcdcc 10000ms (15:01:32.234)]

Note kyverno is 3rd party policy tool

It looks the related webhook is not in running state and because of the failure policy set to fail it prevents the healthcheck from completion 

Resolution

To bypass the problem, 

Export the webhook into yaml

Delete the webhook

Confirm after starting the kube-api service if healtheck passes

kubectl get --raw /healthz?verbose

kubectl get --raw /healthz?verbose
[+]ping ok
[+]log ok
[+]etcd ok
...
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
....
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok

 

 

Additional Information

There could be potentially other issues related to the kube-api healthcheck in such situations reach out for support for diagnosis.