Prometheus-server got CrashLoopBackOff status.
# kubectl -n tanzu-system-monitoring get pods
NAME READY STATUS RESTARTS AGE
prometheus-server-### 1/2 CrashLoopBackOff 30 (3m32s ago) 132m
Error message
# kubectl -n tanzu-system-monitoring logs prometheus-server-### -c prometheus-server | grep error
ts=2025-02-27T05:31:26.411Z caller=main.go:1159 level=error err="opening storage failed: get segment range: segments are not sequential"
Tanzu Kubernetes Grid
The prometheus-server pod fails to start due to data inconsistencies in the /data/wal/
directory, which is stored in the persistent volume.
https://github.com/prometheus/prometheus/issues/5342
Caution - Prometheus lost the past data(Data collected in the last few hours to a few days from WAL and active chunks).
Mount the persistent volume via a new pod and fix the data inconsistencies manually.
1. Stop the prometheus-server pod
namespace=tanzu-system-monitoring
kubectl -n $namespace scale deployment/prometheus-server --replicas=0
2. Create a new pod to mount the persistent-volume of the prometheus-server
pvc=prometheus-server
image=gcr.io/google-samples/node-hello:1.0
cat > recovery-pod-for-prometheus.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: recovery-prometheus
namespace: ${namespace}
spec:
volumes:
- name: prometheus-server-pv
persistentVolumeClaim:
claimName: ${pvc}
containers:
- name: hello-world
image: ${image}
command: [ "sleep", "365d" ]
volumeMounts:
- name: prometheus-server-pv
mountPath: "/data"
EOF
kubectl apply -f recovery-pod-for-prometheus.yaml
3. Delete /data/wal/ + /data/chunks_head
kubectl -n $namespace exec -it recovery-prometheus -- bash
rm -r /data/wal/*
rm -r /data/chunks_head/*
exit
4. Resume the prometheus-server pod
# Delete the temporary pod
kubectl delete -f recovery-pod-for-prometheus.yaml
# prometheus-server: 0 --> 1
kubectl -n $namespace scale deployment/prometheus-server --replicas=1
# Check - prometheus-server pod is "Running"
kubectl -n $namespace get pods