Upgrade to 4.2 fails because of druid pods being in a CrashloopBackOff as a result of disk space being full.
search cancel

Upgrade to 4.2 fails because of druid pods being in a CrashloopBackOff as a result of disk space being full.

book

Article ID: 375495

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware NSX Networking

Issue/Introduction

During NAPP upgrade to 4.2, all Druid pods will be restarted in the following order: Druid Historical -> Druid Middle Manager -> Druid Coordinator -> Druid Broker -> Druid Router. 

On the upgrade UI, druid pods are showing "Upgrade Status Failed".
On NSX manager, running napp-k get pods | grep druid shows multiple pods in CrashLoopBackOff.
Identify the first pods that are crashing based on the restart order "Druid Historical -> Druid Middle Manager -> Druid Coordinator -> Druid Broker -> Druid Router", which can be historical pods or middle manager pods. Print their logs using napp-k logs <pod name>, and check if following logs are seen:


2024-08-19T16:26:09,796 WARN [main] oshi.software.os.linux.LinuxOperatingSystem - Did not find udev library in operating system. Some features may not work. "

2024-08-19T16:26:09,796 WARN [main] oshi.software.os.linux.LinuxOperatingSystem - Did not find udev library in operating system. Some features may not work.
Exception in thread "main" java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error in custom provider, java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.Native
  at org.apache.druid.server.metrics.MetricsModule.getOshiSysMonitor(MetricsModule.java:203) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$O
verrideModule -> org.apache.druid.server.metrics.MetricsModule)
  at org.apache.druid.server.metrics.MetricsModule.getOshiSysMonitor(MetricsModule.java:203) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$O
verrideModule -> org.apache.druid.server.metrics.MetricsModule)

Environment

This only impacts customers that are upgrading to NAPP 4.2 regardless of the version they are on.

Cause

The Druid Historical and Druid Middle Manager pods use PVCs as temporary storage, and they may be filled up during/before the upgrade. If the pods restart while the disks are full, the OSHI process (responsible for collecting metrics like CPU and memory) cannot extract libraries on the disk and will not be able to run, causing the pods to crash. While the pods are crashing, disk clean up cannot be completed, so the pods will remain in CrashLoopBackOff.

Resolution

This will be fixed in a future version, until then the full PVC volume needs to be deleted to remove the temporary files using the process below:

  1. Find the pods with error "Could not initialize class com.sun.jna.Native", for example "druid-middle-manager-0".
  2. Run "napp-k get pods <pod name> -o jsonpath='{.spec.volumes[].persistentVolumeClaim.claimName}'" and note down the PVC name to be used in the next command
  3. Run napp-k delete pvc <pvc name> to delete the PVC. The command may hang, which is expected, and you may exit with "Ctrl+C".
  4. Run napp-k get pvc to verify the deleted pvc is in "Terminating" state.
  5. Run napp-k delete pod <pod name> from step 1 to restart the pod with error
  6. After restart, the pod should come up in "Running" state. Other druid pods in CrashLoopBackOff will also come up subsequently. You can verify the status of these pods by running napp-k get pods | grep druid
  7. One the druid pods are all running, you can retry the 4.2 upgrade.