Steps to manually collect heap dump of nsx-config container in support bundle
search cancel

Steps to manually collect heap dump of nsx-config container in support bundle

book

Article ID: 388450

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When nsx-config container is OOMKilled, heap dump is not getting created.

Environment

SSP 5.0

Cause

Some of the reasons for heap dump not getting created are

  • Container inside the nsx-config pod reaches memory limit before java process running inside it reaching the -Xmx limit and kubernetes kills container leaving heap dump not created.
  • Due to resource exhaustion(like CPU, memory and other system limits), if the container fails to respond for liveness probe, Kubernetes will restart the container.
  • Even if proper jvmOptions are defined for heap dump creation in OutOfMemory scenarios, if the container is terminated by Kubernetes too quickly after the OOM event, the heap dump process may not be triggered.
  • Generating a heap dump requires enough free disk space on worker node's ephemeral disk. If the disk is full or nearly full, the JVM may fail to write the heap dump, even if it’s configured to do so. As the -Xmx of containers in pods nsx-config-0-0 and nsx-config-1-0 are 3750M and 1500M respectively, we need 5.7GB and 2.3GB respectively. heap dump size would be 1-1.5x of the defined -Xmx.

Resolution

In scenarios where heap dump is not created automatically, below steps need to be followed to create heap dump manually to collect in support bundle. These steps need to be executed in SSPI VM.

  • Output of the healthy nsx-config pods.
k -n nsxi-platform get pods | grep nsx-config
nsx-config-0-0                                                   2/2     Running     0               128m
nsx-config-1-0                                                   2/2     Running     0               127m
nsx-config-create-kafka-topic-6x729                              0/1     Completed   0               3h50m
nsx-config-postgres-ids-signature-purge-cronjob-28992960-p972d   0/1     Completed   0               164m
  • Output of the unhealthy nsx-config pods.
k -n nsxi-platform get pods | grep  nsx-config
 
nsx-config-0-0                                                     1/2     CrashLoopBackOff   12 (14s ago)    6h27m
nsx-config-1-0                                                     2/2     Running            0               6h21m
nsx-config-create-kafka-topic-j6lmg.                               0/1     Completed          0               6h27m
 
k -n nsxi-platform describe pod nsx-config-0-0
 
Ports: 8080/TCP, 8443/TCP
    Host Ports: 0/TCP, 0/TCP
    State: Running
      Started: Thu, 21 Nov 2024 17:28:42 +0000
    Last State: Terminated
      Reason: OOMKilled
      Exit Code: 137
      Started: Thu, 21 Nov 2024 17:26:50 +0000
      Finished: Thu, 21 Nov 2024 17:28:25 +0000
    Ready: True
    Restart Count: 2
    Limits:
  •  If nsx-config-0-0 pod is having memory issue,  execute the below steps below to collect heap dump inside nsx-config container of pod.

    > exec into nsx-config container of nsx-config-0-0 pod.
    root@sspi:~# k exec -it -n nsxi-platform nsx-config-0-0 -c nsx-config -- /bin/bash
    groups: cannot find name for group ID 2000
    nsx-user@nsx-config-0-0:/
    > get the PID of the java process running inside the container.
    nsx-user@nsx-config-0-0:/$ pgrep -f nsx-config
    7
    > Create the heap dump manually by executing below command. Ensure the same path "/var/dump/pace" is used to collect the heap dump in support bundle.
    > jmap -dump:live,file=/var/dump/pace/<heap_dump_file_name> <PID>
    > heap_dump_file_name is the file name with which heap dump is created. Below is the example,
    nsx-user@nsx-config-0-0:/$ jmap -dump:live,file=/var/dump/pace/nsx-config-0.hprof 7
    Dumping heap to /var/dump/pace/nsx-config-0.hprof ...
    Heap dump file created [129548512 bytes in 0.868 secs]
  • Next when creating a support bundle from UI, make sure analytics and Include core files and audit logs options are selected. when a support bundle for SSP is created, heap dump will be found inside it with name having prefix core- to the heap_dump_file_name given above. Like core-nsx-config-0.hprof will be present in the support bundle.
  • Execute the below steps, in case the UI is not accessible for any reason. These steps need to be executed outside SSPI, any Linux VM with connectivity to SSP.
    > API to collect support bundle.
    curl -sku admin:<PASSWORD> -X POST 'https://<SSP FQDN>:443/ssp/cluster-api/support-bundle/collection?action=collect' -H "Content-Type: application/json" -d '{"dynamic_content_filters":["DATA_STORAGE","CONFIGURATION_DATABASE","METRICS","ANALYTICS","MESSAGING","PLATFORM_SERVICES"],"log_age_limit":7,"content_filters":["ALL"]}'

    Response:
    {
    "failed_nodes": null,
    "remaining_nodes": [
    {
     "node_display_name": "ssp",
     "node_id": "ed5af649-2ba2-484b-8d5e-284fa7a79559",
     "status": "running"
    }
    ],
    "remoteTaskID": "937",
    "request_properties": {},
    "status": "running",
    "success_nodes": null
    }
    > API to check the status of support bundle. check the status and wait till status is success("status": "success").
    curl -sku admin:<PASSWORD> -X GET 'https://<SSP FQDN>:443/ssp/cluster-api/support-bundle/status/<remoteTaskID from collect API response>'

    Response:
    {
    "details": {
    "failed_nodes": [],
    "remaining_nodes": [
     {
      "node_display_name": "ssp",
      "node_id": "ed5af649-2ba2-484b-8d5e-284fa7a79559",
      "status": "running"
     }
    ],
    "success_nodes": []
    },
    "request_properties": {
    "content_filters": [
     "ALL"
    ],
    "dynamic_content_filters": [
     "DATA_STORAGE",
     "CONFIGURATION_DATABASE",
     "METRICS",
     "ANALYTICS",
     "MESSAGING",
     "PLATFORM_SERVICES"
    ],
    "log_age_limit": 7
    },
    "status": "running" <<status field>
    }
    > API to download support bundle.
    curl -sku admin:<PASSWORD> -X GET 'https://<SSP FQDN>:443/ssp/cluster-api/support-bundle/response/<remoteTaskID from collect API response>' --output ssp.tar.gz
     > Verify, it is downloaded.
    [root@lvn-dvm-10-70-180-45 ~]# ls -lrt ssp*
    -rw-r--r-- 1 root root 1493202767 Feb 14 18:05 ssp.tar.gz

Additional Information

Note:

  • To make sure the process has enough memory to create heap dump, edit nsx-config stateful set and change memory limits. Increase the memory limit by 1GB in the pod having issue.
    Example:  kubectl -n nsxi-platform edit sts nsx-config-0

         For nsx-config-0 stateful set

resources:
    limits:
      memory: 6000Mi
  • After collecting heap dump in support bundle, remove the heap dump files created from the ephemeral storage
    kubectl exec -it -n nsxi-platform nsx-config-0-0 -c nsx-config -- /bin/bash
    cd /var/dump/pace
  • remove all files here.