Post successful ATP upgrade Spark apps, including rawflow, overflow and context spark apps, maybe be stuck in Init state.
search cancel

Post successful ATP upgrade Spark apps, including rawflow, overflow and context spark apps, maybe be stuck in Init state.

book

Article ID: 329222

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • Spark apps, including rawflow, overflow and context spark apps, maybe be stuck in Init state.
root@tb301-runner:~[603]# kubectl get pods -n nsxi-platform | grep -v Running | grep -v Completed
NAME                                                     READY   STATUS      RESTARTS      AGE
spark-app-context-driver                                 0/2     Init:0/3    0             22h
spark-app-overflow-driver                                0/2     Init:0/4    0             22h
  • Describing of the pods shows fail to mount volume e.g. spark-conf-volume-driver.
root@tb301-runner:~[599]# kubectl describe pods spark-app-context-driver -n nsxi-platform
Name:           spark-app-context-driver
Namespace:      nsxi-platform
Priority:       0
Node:           intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n/30.30.0.55
Start Time:     Tue, 18 Jul 2023 00:19:00 -0700
Labels:         allow-traffic-to-dns=true
....TRUNCATED
Events:
  Type     Reason       Age                    From                                                         Message
  ----     ------       ----                   ----                                                         -------
  Warning  FailedMount  14m (x552 over 22h)    kubelet, intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n  MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-b114368967dd9410-conf-map" not found
  Warning  FailedMount  4m31s (x685 over 21h)  kubelet, intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[spark-conf-volume-driver], unattached volumes=[kube-api-access-8hv5g spark-local-dir-1 driver-coredump processing-tls-cert-volume context-log4j-properties-volume context-well-known-user-sid-volume wait-for-secret-scripts context-override-properties-volume spark-conf-volume-driver]: timed out waiting for the condition


root@tb301-runner:~[607]# kubectl describe pods spark-app-overflow-driver -n nsxi-platform
Name:           spark-app-overflow-driver
Namespace:      nsxi-platform
Priority:       0
Node:           intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n/30.30.0.55
Start Time:     Tue, 18 Jul 2023 00:19:10 -0700
Labels:         allow-traffic-to-dns=true
                allow-traffic-to-kubeapi=true
                app.kubernetes.io/instance=intelligence
....TRUNCATED
Events:
  Type     Reason       Age                    From                                                         Message
  ----     ------       ----                   ----                                                         -------
  Warning  FailedMount  14m (x560 over 22h)    kubelet, intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n  MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-31e4058967ddae33-conf-map" not found
  Warning  FailedMount  4m55s (x696 over 22h)  kubelet, intelligencecluster-workers-r9dc5-7b87b98dff-f7h5n  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[spark-conf-volume-driver], unattached volumes=[spark-conf-volume-driver driver-coredump overflow-log4j-properties-volume spark-local-dir-1 kube-api-access-4vvpd processing-tls-cert-volume wait-for-secret-scripts scripts overflow-override-properties-volume]: timed out waiting for the condition



Environment

VMware NSX 4.1.0

Resolution

Currently, no resolution is available for this.

Workaround:
If any of the driver pods of spark apps have the error and are in the Init state, delete the driver pods. They will be automatically restarted by Kubernetes.

Log into the NSX manager as root user and then issue the following commands accordingly:

napp-k delete pod spark-app-context-driver
napp-k delete pod spark-app-overflow-driver
napp-k delete pod spark-app-rawflow-driver


Or this , if you are login to the Tanzu cluster 
kubectl -n nsxi-platform delete pod spark-app-context-driver
kubectl -n nsxi-platform delete pod spark-app-overflow-driver
kubectl -n nsxi-platform delete pod spark-app-rawflow-driver


Additional Information

Versions where this a known Issue - 4.1.1

Impact/Risks:
Users will be unable to see blocked/external/cross-site flow records received by NSX Intelligence when overflow driver stuck in Init status. No context data will be received when context driver stuck in Init status. Users will not see any flow if rawflow driver is stuck in Init status.