Due to issue with Deadlock while reading /proc/net/softnet_stat. More details https://github.com/vmware/photon/commit/ed28c67a054c2e70d4be2f2b6ba5870da712bb20
Reading of /proc/net/softnet_stat from userspace while receiving and processing network packets on the same CPU can lead to deadlocking entire system.
Impact is on Photon OS 5 if kernel version lower than 6.1.128-2.ph5. Linux RPM version 6.1.128-2.ph5 or higher contains a fix
If your appliance constantly monitors network activity directly by reading /proc/net/softnet_stat or through network tools the system can be impacted and system can crash.
TKGm 2.5.x clusters deployed with Photon OS 5 where kernel version lower than 6.1.128-2.ph5
One of the packages that can be deployed with tanzu - prometheus package contains node_exporter daemonset which by default is scraping the problem path.
This will inevitably would lead to a crash of random worker or master node usually one time of random node in a week time.
Upgrade to 2.5.3 once available where Photon 5 is shipped with the fixed version.
To prevent the node exporter from the prometheus package to scrape the softnet stats option "--no-collector.softnet" can be added Below are the steps to apply this configuration:
#@ load("@ytt:overlay", "overlay")
#@overlay/match by=overlay.subset({"kind":"DaemonSet", "metadata":{"name":"prometheus-node-exporter"}}),expects=1
---
spec:
template:
spec:
containers:
#@overlay/match by="name"
- name: prometheus-node-exporter
args:
#@overlay/append
- --no-collector.softnet
kubectl create secret generic node-exporter-fix -n tanzu-system-monitoring -o yaml --dry-run=client --from-file=overlay.yaml | kubectl apply -f -
kubectl annotate pkgi -n tanzu-system-monitoring prometheus ext.packaging.carvel.dev/ytt-paths-from-secret-name.0=node-exporter-fix
kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":true}}'
kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":false}}'
Verification:
kubectl get pkgi -n tanzu-system-monitoring prometheus -oyaml
apiVersion: packaging.carvel.dev/v1alpha1
kind: PackageInstall
metadata:
annotations:
ext.packaging.carvel.dev/ytt-paths-from-secret-name.0: node-exporter-fix
kubectl get app -n tanzu-system-monitoring prometheus -oyaml
apiVersion: kappctrl.k14s.io/v1alpha1
kind: App
...
template:
- ytt:
ignoreUnknownComments: true
inline:
pathsFrom:
- secretRef:
name: node-exporter-fix
To revert the changes:
kubectl delete secret -n tanzu-system-monitoring node-exporter-fix
kubectl annotate pkgi -n tanzu-system-monitoring prometheus ext.packaging.carvel.dev/ytt-paths-from-secret-name.0-
kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":true}}'
kubectl patch app -n tanzu-system-monitoring prometheus --type merge -p '{"spec":{"paused":false}}'