NTA LLMNR NBTNS Detector Fails in Environments with No Valid LLMNR Responses and Port Collisions

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

LLMNR NBTNS detector will fail and report a failed status in environments where there are valid LLMNR requests, no valid LLMNR responses, and there are flows during the same day which use the same destination port as the source port from which the LLMNR requests were sent.

The LLMNR NBTNS detector will fail and report a failed status. The first and most obvious indication of this is that the LLMNR NBTNS detector's status will change to failed in the UI. This could be to a number of reasons however, so to narrow it down the pod logs must be checked:

# access the nsx manager via root

napp-k -n nsxi-platform get pods | grep llmnr

# look for llmnrnbtns pods

napp-k -n nsxi-platform logs <llmnrnbtns pod>

The following log should be observed:

2024-11-20 08:55:05,786 - [MainThread] - common.utils.detector.core - ERROR - Failed to detect events on site <redacted site ID> due to 'srcVmId' is both an index level and a column label, which is ambiguous.
2024-11-20 08:55:05,791 - [MainThread] - common.utils.detector.core - ERROR - Traceback (most recent call last):
  File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 116, in _run
    executor.submit(self._run_on, config).result(timeout=float(Config.DETECTOR_TIMEOUT_SEC))
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 165, in _run_on
    baseline_was_updated = self._update_baseline(config.site_id)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 227, in _update_baseline
    partial_baseline = self._generate_baseline(site_id, interval)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vmware/nsx/intelligence/nta/detectors/llmnrnbtns/detector/core.py", line 122, in _generate_baseline
    self._get_llmnr_partial_baseline_data(interval, site_id),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vmware/nsx/intelligence/nta/detectors/llmnrnbtns/detector/core.py", line 191, in _get_llmnr_partial_baseline_data
    response_result = response_result.groupby("srcVmId")[["count", "dstVmId"]].agg(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/frame.py", line 8402, in groupby
    return DataFrameGroupBy(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/groupby/groupby.py", line 965, in __init__
    grouper, exclusions, obj = get_grouper(
                               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/groupby/grouper.py", line 878, in get_grouper
    obj._check_label_or_level_ambiguity(gpr, axis=axis)
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/generic.py", line 1797, in _check_label_or_level_ambiguity
    raise ValueError(msg)
ValueError: 'srcVmId' is both an index level and a column label, which is ambiguous.

This log confirms that the error has occurred.

Environment

NAPP 4.2.x

Cause

In environments where:

The LLMNR NBTNS detector is activated
There are valid LLMNR requests
There is NO valid LLMNR responder
VM generating unrelated flows where the "destination port" overlaps with "source ports used in LLMNR requests". Port scanner can be one such source creating these flows.

Resolution

This issue is caused by four circumstances in the environment coming together to generate an edge case in the application logic. By changing any one of the four, the issue will no longer be present. For example:

- Have a valid LLMNR responder if LLMNR NBTNS detector is activated
- Stop port scans when there is no valid LLMNR responder, which can generate flows leading to edge case evetually crashing the NTA detector

There is currently no other workaround available. The issue will be fixed in the product in a future release.