SSP: NSX config pod going out of memory during security Intelligence activation in large scale setup

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When NSX is configured with large scale configuration and we have large number of groups when large number of ipsets to it , during nsx-config full sync the nsx-config-0-0 pod can go out-of-memory.

Environment

SSP 5.0

Cause

The out-of-memory for nsx-config-0-0 pod can happen due to in-memory relationship building of ipAddress to groups. Our acceptable scale for IP based groups is 10,000 and if there are more number of IP based groups or many groups with ipsets ranging from 0-5000 or more IP Addresses, this issue can be observed.

This issue can be confirmed based on following observations :

1. Run the following command to check the status of the pod

kubectl -n nsxi-platform describe pod nsx-config-0-0

In the output from above command, search for nsx-config containers and check the Last State/Reason, to confirm if it is the restart was due to OOM issue. The last state should be Terminated with reason should be OOMKilled.

Containers:
  nsx-config:
    Container ID: <not displayed>
    Image: <not displayed>
    Image ID:  <not displayed>
    Ports: 8080/TCP, 8443/TCP
    Host Ports: 0/TCP, 0/TCP
    State: Running
      Started: Mon, 27 Jan 2025 23:29:16 -0800
    Last State: Terminated
      Reason: OOMKilled
      Exit Code: 137
      Started: Mon, 27 Jan 2025 20:28:48 -0800
      Finished: Mon, 27 Jan 2025 23:29:15 -0800
    Ready: True
    Restart Count: 1
    Limits:
      cpu: 8
      ephemeral-storage: 1Gi
      memory: 5000Mi
    Requests:
      cpu: 320m
      ephemeral-storage: 100Mi
      memory:

2. Another observation that could be done is to check the current memory consumption of nsx-config-0-0 pod for 10minutes and reaching above 45000Mi.

Check current memory consumption

watch kubectl -n nsxi-platform top pod -l app.kubernetes.io/name=nsx-config

Output

Every 2.0s: kubectl -n nsxi-platform top pod -l app.kubernetes.io/name=nsx-config
 
       NAME            CPU(cores)              MEMORY(bytes)
nsx-config-0-0         11m                           4897Mi
nsx-config-1-0         11m                          1066Mi

Else run the following script to record the output to the file and terminate the script after 10 mins.

Script to record memory consumption of nsx-config pod

while true
    do
        echo Running kubectl top command for nsx-config pod
        date >> nsx_config_pod.txt
        kubectl -n nsxi-platform top pod -l app.kubernetes.io/name=nsx-config >> nsx_config_pod.txt
        sleep 5
    done

3. Observe nsx-config-0-0 logs using following command containing large number of ipsets for multiple groups

Check nsx-config-0-0 pod memory

kubectl -n nsxi-platform logs nsx-config-0-0

.
..Sending to druid ManagerRealizationConfig {
"revision" : 0,
"tags" : [ ],
"nsx_agent_seen_time" : 1741214631148,
"site_id" : "3c582b44-9a82-4363-9e1a-92ce6b8f622a",
"config_type" : "NS_GROUP",
"timestamp" : "2025-03-05T22:45:08.923492328Z",
"epoch" : 7,
"mp_uuid" : "084838e8-06d0-4a4a-b9e9-1f404cab9e64",
"policy_path_from_tag" : "/infra/domains/default/groups/ukgrp3_0",
"display_name" : "ukgrp3_0",
"create_user" : "admin",
"create_time" : 1741180839520,
"last_modified_user" : "admin",
"last_modified_time" : 1741180839520,
"deleted" : false,
"deletion_time" : 0,
"scope" : "LOCAL",
"scopeTagPair" : [ ],
"effective_and_related_compute_members" : [ ],
"effective_segments" : [ ],
"effective_segment_ports" : [ ],
"membership_types" : [ "IPAddress" ],
"ip_set_contents" : [ "X.X.X.11, ... ,"X.X.X.99" ], <--- Large number of ipsets
"system_owned" : false
}

Resolution

Check if the kubernetes cluster has capacity for an increase in pod memory based on the nodes memory consumption percentage which should't be above 70% overall combining all worker nodes capacity.

Nodes memory capacity

$kubectl top nodes
NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
cp-3 399m 9% 2826Mi 36%
worker-1 670m 4% 14676Mi 22%
worker-2 599m 3% 15088Mi 23%
worker-3 691m 4% 14357Mi 22%
worker-4 584m 3% 22017Mi 34%
cp-1 292m 7% 2196Mi 28%
cp-2 342m 8% 2471Mi 31%

2. Check current memory allocation for nsx-config container in nsx-config-0-0 pod, which should be 5000Mi both requested and limit. If already 7000Mi that would infer that the remediation was already applied and further steps may not be helpful. in such scenario contact support.

Check nsx-config-0-0 pod memory

kubectl -n nsxi-platform get sts nsx-config-0 --output=jsonpath="[{.spec.template.spec.containers[?(@.name=='nsx-config')].resources.limits.memory},{.spec.template.spec.containers[0].resources.requests.memory}]"

3. Increase the full sync timeout , run the following command to update the nsx-config config-map

Increase nsx-config-0-0 pod memory

kubectl -n nsxi-platform edit cm nsx-config

the update fullSyncTimeoutMills value to 3600000 as mentioned below

apiVersion: v1
data:
  application.yaml: |-
    debug: False
    fullsync:
      fullSyncTimeoutMills: 3600000
      cron: 0 0 0 1 * * # Run monthly
    configSync:
      configSyncThreads: 3
    filtervmupdates:

4.Increase the pod memory to 7000Mi using following command

Increase nsx-config-0-0 pod memory

kubectl -n nsxi-platform patch sts nsx-config-0 --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "7000Mi"},{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "7000Mi"}]]'

5. Check current memory allocation for nsx-config container in nsx-config-0-0 pod now using the command in Step 2, it should be 7000Mi both requested and limit.