Resolving False MPS Analyst API Alarms in NSX Environments

Products

VMware NSX

Issue/Introduction

This article provides a workaround for resolving false alarms related to the MPS Analyst API in NSX UI. These alarms occur due to a processing issue, despite successful connectivity and data retrieval from the Lastline Cloud API.

Symptoms:

In the NSX User Interface, the MPS Analyst API alarm is displayed as active. This occurs even though MPS successfully connects to the Lastline Cloud API.
MPS is able to fetch the most recent verdicts from Lastline Cloud. Within the sa-scheduler-services pod on the NAPP platform, two specific log entries are noted.

{"log":"2023-11-01T15:45:36.772141862Z stdout F \u001b[37m2023-11-01T15:45:36,771\u001b[m \u001b[32mINFO  \u001b[m[\u001b[1;34mscheduling-1\u001b[m] \u001b[1;33mc.v.n.s.a.s.AnalystSyncService\u001b[m: SECURITY [nsx@6876 comp=\"nsx-manager\" level=\"INFO\" subcomp=\"manager\"] Fetched 16387  from LastLine Cloud"

{"log":"2023-11-01T03:37:08.883551951Z stdout F \u001b[37m2023-11-01T03:37:08,883\u001b[m \u001b[1;31mERROR \u001b[m[\u001b[1;34mscheduling-1\u001b[m] \u001b[1;33mc.v.n.s.a.s.AnalystSyncDataScraper\u001b[m: SECURITY [nsx@6876 comp=\"nsx-manager\" errorCode=\"MP102251\" level=\"ERROR\" subcomp=\"manager\"] Exception occurred in AnalystSync Service, exception details : null","kubernetes":{"pod_name":"sa-scheduler-services-6abcdxyyz5-6abcd","namespace_name":"nsxi-platform","pod_id":"<UUID>","host":"napp-cluster-default-workers-abcdz-dabcd-labcl","container_name":"sa-scheduler-services","docker_id":<UUID>","container_hash":"projects.registry.vmware.com/nsx_application_platform/clustering/sa-scheduler-services@sha256:XXXXXXXXXXYYYYYYYYYYZZZZZZZZZZ","container_image":"sha256:AAAAABBBBBBBBCCCCCCC"}}

These logs confirm the successful data retrieval by MPS from Lastline Cloud. However, an exception is identified during the processing of this data.
Affected versions: 4.x

Environment

VMware NSX-T Data Center

Cause

The trigger for the analyst API alarm in the backend is twofold: it encompasses not only the connection to the Lastline Cloud Analyst API service but also the processing and forwarding of results to ASDS and NDR components. Ideally, the alarm's scope should be limited to the connectivity aspect with the Lastline Cloud

Resolution

Fix to be provided in the future releases.

Workaround:
Remove the rescoring sync time field from the Postgres database on the NAPP platform to reset it to the most recent time. This action ensures the problematic event is bypassed in the backend. The sa-scheduler-service processes events at 5-minute intervals. Following the deletion of the sync time, the alarm should automatically resolve within the next 5 minutes, provided new events are processed successfully.
Steps:
1. Access Postgres Database
From NSX manager CLI, execute:

napp-k exec -it postgresql-ha-postgresql-0 -- /bin/bash

Fetch the Postgres password:

echo $POSTGRES_PASSWORD

Launch psql CLI and enter password:

psql -U postgres -h localhost

2. On psql CLI, connect to the relevant database:

"\c malwareprevention"

3. Execute the deletion command:

DELETE FROM sa_configurations WHERE key = 'rescoring-sync-time';

This action resets the rescoring sync time and skips the problematic event.

Events are processed by the sa-scheduler-service every 5 minutes. The alarm should resolve automatically after successful processing of new events following the deletion.

Note: This workaround is a temporary solution. Monitor the system for any anomalies post-implementation and await the official fix in the forthcoming release.