Feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" Alarm

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

If an alarm with Feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" gets raised, the configs streamed to NSX Application Platform may be impacted.

Environment

This impacts all versions of NSX 4.2.1 and higher irrespective of NSX Application Platform (NAPP) version.

Cause

The alarm could be raised due to one of the below reasons:

(1) Connectivity issue between NSX and NSX Application Platform.
(2) Connectivity Issue with the database.
(3) Failed to sync configuration data from NSX to NSX Application Platform.

Note: It will try to auto remediate in the backend every 5 minutes by attempting to send all configuration data again to NSX Application platform, if auto remediation resolves the issue, then the alarm would automatically be marked as "Resolved" on the NSX UI.

If even after 5-7 minutes the alarms does not get resolved, below steps will help in resolving it.

Resolution

There can be 3 reasons for which an alarm with Feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" can be raised. The alarm has a field "View Runtime Details" which will have the exact reason on why the alarm was raised.

These reasons are -

(1) Connectivity issue between NSX and NSX Application Platform.
(2) Connectivity Issue with the database.
(3) Failed to sync configuration data from NSX to NSX Application Platform.

Please follow below steps corresponding to the reasons above to resolve the issues.

(1) Connectivity issue between NSX and NSX Application Platform →

While sending configuration data from NSX to NSX Application Platform, there is a retry mechanism in place to prevent intermittent connectivity issues. But, if there is a prolonged connectivity issue and all the retries are exhausted, an alarm of this type will be raised.

There can be various reasons for this connectivity failure caused from Kafka cert update, including SSL Certificate issues between the NSX and the communication channel (kafka), or the communication channel being down completely.

To resolve this issue, please follow below steps -

(a) If the Kafka Connectivity between NSX Config Agent to NAPP/SSP is broken and unable to recover from a bad state.

To confirm the alarm's root cause, access the affected NSX Manager's(To get affected NSX Manager, expand alarm and check 'Reported by Node') command-line interface and execute the following command.

cd /var/log/proton

zgrep "Exception happened when receiving instruction from NSX Intelligence" -A 1 nsxapi.*

Above command should show many occurrences of below error:

java.lang.IllegalStateException: This consumer has already been closed.

If you see the above snippets , please follow the below KB :

https://knowledge.broadcom.com/external/article?articleNumber=387918

(b) Login as root user to NSX CLI.

Check if all the Kafka PODs are up on k8s by running below command -

napp-k get pods | grep kafka

Since kafka is a statefulset, there should be PODs named kafka-0, kafka-1 etc.

(c) If all the kafka PODs are up, check for any error in kafka logs and resolve the errors. Logs for kafka can be seen using below command -

napp-k logs kafka-1 -f

If Kafka PODs are not up, check all the kafka PODs to see if there is an error in these PODs like network issues or SSL issues, and raise this issue with ANS Broadcom Support Team for proper resolution.
After issues with connectivity channel are resolved, this alarm should automatically be resolved on the UI.

(2) Connectivity issue with database tables →

When this alarm is Open on the NSX UI, any newly created configuration data like Groups, Rules, Policy etc. might not be sent from NSX to NSX Application Platform.

This issue can be seen when corfu is down. To check if corfu is down, follow below steps -

(a) Login via admin user to NSX CLI of Primary NSX Manager.
Run below command to verify status of Datastore.

get cluster status

(b) If the status of "Datastore" is DOWN, login via root user to Primary NSX and run below command to restart corfu.

service corfu-server restart

(c) Check the cluster status after 5-7 minutes to check if status of "Datastore" is UP, using below command -

get cluster status
After corfu is up, the alarm should get resolved automatically on the NSX UI, if the alarm is not resolved, follow further steps.

(d) If the issue is not yet resolved but "Datastore" is UP, then restart proton using below command by logging in to NSX CLI using admin user to Primary NSX Manager -

service proton restart
After proton is restarted, and alarm is still not resolved, please contact ANS Broadcom Support Team to get proper resolution on this alarm.

After issues with subscription to database tables is resolved, this alarm should automatically be resolved on the UI.

(3) Failed to sync configuration data from NSX to NSX Application Platform →

There can be multiple reasons for this connectivity failure, including a corner case where the object to be streamed has not properly realised yet or had some issue while processing it like late realisation of the object on the backend etc.

To resolve this issue, please login to any NSX Manager as root user and restart nsx-config pod as below -

napp-k rollout restart sts nsx-config

After issues with connectivity channel are resolved, this alarm should automatically be resolved on the UI.