This article describes an issue where the Kubernetes service-ip-allocator fails to start due to leaked NodePorts.
The following errors are observed during the Kubernetes API server startup:
poststarthook/start-service-ip-repair-controllers failed: not finished
healthz.go:261] informer-sync,poststarthook/start-service-ip-repair-controllers,poststarthook/rbac/bootstrap-roles check failed: readyz
As a result, the service-ip-allocator controller is unable to initialize properly.
TKGi 1.x
This issue is linked to Kubernetes GitHub Issue #130377.
The problem arises when a ResourceQuota is configured with a hard limit of 0 for services.nodeports. In this scenario, services such as those created by cert-manager for HTTP01 challenges attempt to create NodePort services without specifying a NodePort or ClusterIP. The kube-apiserver dynamically allocates these values and updates the corresponding bitmaps in etcd. Although the ResourceQuota controller subsequently denies the service creation with a 403 Forbidden error due to the NodePort limit, the allocated NodePort and ClusterIP values are not released. Over time, repeated retries by cert-manager cause a resource leak, leading to service-ip-allocator startup failures. In the impacted environment, cert-manager was configured with a default Issuer using the NodePort service type, which triggered this behavior.
This problem will not be present in kubernetes 1.33.
As a workaround, Update the cert-manager Issuer and ClusterIssuer configurations to use the ClusterIP service type instead of NodePort.
In case of a future occurrence the recommendation is to monitor the bitmap size and updates using:
/var/vcap/jobs/etcd/bin/etcdctl get --prefix /registry/ranges
# And monitor the value of
/registry/ranges/servicenodeports