Pods failing to restart and going to completed state in kubernetes v1.27.5
search cancel

Pods failing to restart and going to completed state in kubernetes v1.27.5

book

Article ID: 384708

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Pods are failing to startup once they fail due to ephemeral storage and they go to completed state.

Steps to reproduce: 

On a cluster with v1.27.5

Creating a simple statefulset with ephemeral storage limit of 1 GB:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx-statefulset
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: projects5-proxy.packages.broadcom.com/antrea/nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            ephemeral-storage: 200Mi
          limits:
            ephemeral-storage: 1Gi

Exec into the nginx pod and fill the space:

kubectl exec -it nginx-statefulset-0 -- bash
root@nginx-statefulset-0:/# dd if=/dev/urandom^Cf=bigfile.dat bs=1M count=1100

Results in pod eviction due to exceeding the space limit:

1100+0 records in
1100+0 records out
1153433600 bytes (1.2 GB, 1.1 GiB) copied, 5.88756 s, 196 MB/s
root@nginx-statefulset-0:/# command terminated with exit code 137

ubuntu@jumpbox:~$ kubectl get po
NAME                                READY   STATUS      RESTARTS   AGE
nginx-statefulset-0                 0/1     Completed   0          64m
nginx-statefulset-1                 1/1     Running     0          77m

Once the pod is terminated instead of being restarted pod goes into Completed state and is never recovered until it is manually deleted. 

After manual delete the pod is successfully restarted and works fine. 

 

 

Environment

TKGm 2.4.x

KUBERNETES  v1.27.5

TKR v1.27.5---vmware.1-tkg.1

Cause

Regression in the specific version of kubernetes, this issue is fixed in v1.27.9 and above.

 

Resolution

Recommended action is to upgrade to the latest version of TKG to move away from v1.27.5

Additional Information

https://community.replicated.com/t/statefulset-pods-stuck-in-completed-after-node-restart/1410 

https://github.com/kubernetes/kubernetes/issues/124065#issuecomment-2028274184