Prometheus Pod in CrashLoopBackoff state in Tanzu mission control Self-managed environment.
search cancel

Prometheus Pod in CrashLoopBackoff state in Tanzu mission control Self-managed environment.

book

Article ID: 384136

calendar_today

Updated On:

Products

VMware Tanzu Mission Control Self-Managed

Issue/Introduction


It was observed that the prometheus-server-tmc-local-monitoring-tmc-local-0 pod was in CrashLoopBackoff state.  

Environment

Tanzu Mission Control Self-Manage 1.3.0.

Cause

  • We can validate the logs from the prometheus-server-tmc-local-monitoring-tmc-local-0 pod. 
# kubectl logs -f prometheus-server-tmc-local-monitoring-tmc-local-0 -c prometheus -n tmc-local

ts=2024-10-23T13:20:55.551Z caller=main.go:617 level=info msg="Starting Prometheus Server" mode=server version="(version=2.51.2, branch=HEAD, revision=b4c0ab52c3e9b940ab803581ddae9b3d9a452337)"
ts=2024-10-23T13:20:55.551Z caller=main.go:622 level=info build_context="(go=go1.21.9, platform=linux/amd64, user=root@6d5384d4ddf8, date=20240410-15:17:17, tags=netgo,builtinassets,stringlabels)"
ts=2024-10-23T13:20:55.551Z caller=main.go:623 level=info host_details="(Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 prometheus-server-tmc-local-monitoring-tmc-local-0 (none))"
ts=2024-10-23T13:20:55.551Z caller=main.go:624 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2024-10-23T13:20:55.551Z caller=main.go:625 level=info vm_limits="(soft=unlimited, hard=unlimited)"
unexpected fault address 0x7f435429b000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f435429b000 pc=0x472482]
  • The SIGBUS error occurs due the volume used by this specific pod was filled to capacity. 
  • To validate the volume we can check node where the prometheus-server-tmc-local-monitoring-tmc-local-0 pod is scheduled upon & then we can login on to the specific node and can run df -h to validate the size of the volume.

Resolution

Inorder to resolve this issue the volume needs to be expanded in online mode.

You need to follow the below procedure to expand the volume.

  • Kindly validate storage class has Allowvolumeexpansion set to True
# kubectl get sc default
NAME                PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
default (default)   csi.vsphere.vmware.com   Delete          Immediate           true                   3h7m
  • Find the persistent volume claim to resize using the following command:
 # kubectl get pv,pvc,pod -n tmc-local


  • Patch the PVC to increase its size. For example, increase the size to 10 Gi.  
# kubectl patch pvc Data-prometheus-server-tmc-local-monitoring-tmc-local-0 -n tmc-local -p '{"spec": {"resources": {"requests": {"storage": "10Gi"}}}}'
  • Once the patch is completed the pod comes to running status.