During NAPP upgrade to 4.2, all Druid pods will be restarted in the following order: Druid Historical -> Druid Middle Manager -> Druid Coordinator -> Druid Broker -> Druid Router.
On the upgrade UI, druid pods are showing "Upgrade Status Failed".
On NSX manager, running napp-k get pods | grep druid shows multiple pods in CrashLoopBackOff.
Identify the first pods that are crashing based on the restart order "Druid Historical -> Druid Middle Manager -> Druid Coordinator -> Druid Broker -> Druid Router", which can be historical pods or middle manager pods. Print their logs using napp-k logs <pod name>, and check if following logs are seen:
2024-08-19T16:26:09,796 WARN [main] oshi.software.os.linux.LinuxOperatingSystem - Did not find udev library in operating system. Some features may not work. "
2024-08-19T16:26:09,796 WARN [main] oshi.software.os.linux.LinuxOperatingSystem - Did not find udev library in operating system. Some features may not work.
Exception in thread "main" java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Error in custom provider, java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.Native
at org.apache.druid.server.metrics.MetricsModule.getOshiSysMonitor(MetricsModule.java:203) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$O
verrideModule -> org.apache.druid.server.metrics.MetricsModule)
at org.apache.druid.server.metrics.MetricsModule.getOshiSysMonitor(MetricsModule.java:203) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$O
verrideModule -> org.apache.druid.server.metrics.MetricsModule)
This only impacts customers that are upgrading to NAPP 4.2 regardless of the version they are on.
The Druid Historical and Druid Middle Manager pods use PVCs as temporary storage, and they may be filled up during/before the upgrade. If the pods restart while the disks are full, the OSHI process (responsible for collecting metrics like CPU and memory) cannot extract libraries on the disk and will not be able to run, causing the pods to crash. While the pods are crashing, disk clean up cannot be completed, so the pods will remain in CrashLoopBackOff.
This will be fixed in a future version, until then the full PVC volume needs to be deleted to remove the temporary files using the process below: