LLMNR NBTNS detector will fail and report a failed status in environments where there are valid LLMNR requests, no valid LLMNR responses, and there are flows during the same day which use the same destination port as the source port from which the LLMNR requests were sent.
The LLMNR NBTNS detector will fail and report a failed status. The first and most obvious indication of this is that the LLMNR NBTNS detector's status will change to failed in the UI. This could be to a number of reasons however, so to narrow it down the pod logs must be checked:
# access the nsx manager via root
# look for llmnrnbtns pods |
The following log should be observed:
2024-11-20 08:55:05,786 - [MainThread] - common.utils.detector.core - ERROR - Failed to detect events on site <redacted site ID> due to 'srcVmId' is both an index level and a column label, which is ambiguous.
2024-11-20 08:55:05,791 - [MainThread] - common.utils.detector.core - ERROR - Traceback (most recent call last):
File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 116, in _run
executor.submit(self._run_on, config).result(timeout=float(Config.DETECTOR_TIMEOUT_SEC))
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 165, in _run_on
baseline_was_updated = self._update_baseline(config.site_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vmware/nsx/intelligence/nta/detectors/common/utils/detector/core.py", line 227, in _update_baseline
partial_baseline = self._generate_baseline(site_id, interval)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vmware/nsx/intelligence/nta/detectors/llmnrnbtns/detector/core.py", line 122, in _generate_baseline
self._get_llmnr_partial_baseline_data(interval, site_id),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vmware/nsx/intelligence/nta/detectors/llmnrnbtns/detector/core.py", line 191, in _get_llmnr_partial_baseline_data
response_result = response_result.groupby("srcVmId")[["count", "dstVmId"]].agg(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/frame.py", line 8402, in groupby
return DataFrameGroupBy(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/groupby/groupby.py", line 965, in __init__
grouper, exclusions, obj = get_grouper(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/pandas/core/groupby/grouper.py", line 878, in get_grouper
obj._check_label_or_level_ambiguity(gpr, axis=axis)
File "/usr/local/lib/python3.11/dist-packages/pandas/core/generic.py", line 1797, in _check_label_or_level_ambiguity
raise ValueError(msg)
ValueError: 'srcVmId' is both an index level and a column label, which is ambiguous.
This log confirms that the error has occurred.
NAPP 4.2.x
In environments where:
The LLMNR NBTNS detector is activated
There are valid LLMNR requests
There is NO valid LLMNR responder
VM generating unrelated flows where the "destination port" overlaps with "source ports used in LLMNR requests". Port scanner can be one such source creating these flows.
This issue is caused by four circumstances in the environment coming together to generate an edge case in the application logic. By changing any one of the four, the issue will no longer be present. For example:
- Have a valid LLMNR responder if LLMNR NBTNS detector is activated
- Stop port scans when there is no valid LLMNR responder, which can generate flows leading to edge case evetually crashing the NTA detector
There is currently no other workaround available. The issue will be fixed in the product in a future release.