Disable softnet scrape in package prometheus
search cancel

Disable softnet scrape in package prometheus

book

Article ID: 389182

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x VMware Tanzu Kubernetes Grid Plus VMware Tanzu Kubernetes Grid Plus 1.x

Issue/Introduction

Due to issue with Deadlock while reading /proc/net/softnet_stat.  More details https://github.com/vmware/photon/commit/ed28c67a054c2e70d4be2f2b6ba5870da712bb20

Reading of /proc/net/softnet_stat from userspace while receiving and processing network packets on the same CPU can lead to deadlocking entire system.

Impact is on Photon OS 5  if kernel version lower than 6.1.128-2.ph5. Linux RPM version 6.1.128-2.ph5 or higher contains a fix
If your appliance constantly monitors network activity directly by reading /proc/net/softnet_stat or through network tools the system can be impacted and system can crash.

 

Environment

TKGm 2.5.x clusters deployed with Photon OS 5 where kernel version lower than 6.1.128-2.ph5

Cause

One of the packages that can be deployed with tanzu - prometheus package contains node_exporter daemonset which  by default is scraping the problem path. 

This will inevitably would lead to a crash of random worker or master node usually one time of random node in a week time. 

 

Resolution

Upgrade to 2.5.3 once available where Photon 5 is shipped with the fixed version.

To prevent the node exporter from the prometheus package to scrape the softnet stats option "--no-collector.softnet" can be added Below are the steps to apply this configuration:

  1. Generate file overlay.yaml with the fix:
    #@ load("@ytt:overlay", "overlay")
    #@overlay/match by=overlay.subset({"kind":"DaemonSet", "metadata":{"name":"prometheus-node-exporter"}}),expects=1
    ---
    spec:
      template:
        spec:
          containers:
          #@overlay/match by="name"
          - name: prometheus-node-exporter
            args:
            #@overlay/append
            - --no-collector.softnet
  2. Create Secret in the namespace where the packlageinstall is deployed in this example the namespace is tanzu-system-monitoring
    kubectl create secret generic node-exporter-fix -n tanzu-system-monitoring -o yaml --dry-run=client --from-file=overlay.yaml | kubectl apply -f -
  3. Annotate the packageinstall with the secret created
    kubectl annotate pkgi -n tanzu-system-monitoring prometheus ext.packaging.carvel.dev/ytt-paths-from-secret-name.0=node-exporter-fix
  4. Optionally to force reconciliation pause/unpause the application:
    kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":true}}'
    kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":false}}'
     

Verification:

kubectl get pkgi -n tanzu-system-monitoring   prometheus -oyaml
apiVersion: packaging.carvel.dev/v1alpha1
kind: PackageInstall
metadata:
  annotations:
    ext.packaging.carvel.dev/ytt-paths-from-secret-name.0: node-exporter-fix
kubectl get app -n tanzu-system-monitoring   prometheus -oyaml
apiVersion: kappctrl.k14s.io/v1alpha1
kind: App
...
  template:
  - ytt:
      ignoreUnknownComments: true
      inline:
        pathsFrom:
        - secretRef:
            name: node-exporter-fix

 

To revert the changes:

  1. Delete the Secret in the namespace where the packlageinstall is deployed in this example the namespace is tanzu-system-monitoring
    kubectl delete secret -n tanzu-system-monitoring node-exporter-fix
  2. Remove the annotation from the packageinstall:
    kubectl annotate pkgi -n tanzu-system-monitoring prometheus ext.packaging.carvel.dev/ytt-paths-from-secret-name.0-
  3. Optionally to force reconciliation pause/unpause the application:
    kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":true}}'
    kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":false}}'

Additional Information