Alarm with Feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" is seen on NSX UI
search cancel

Alarm with Feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" is seen on NSX UI

book

Article ID: 373834

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

If an Alarm with the feature "NSX Application Config Agent" and Event Type "Config Agent Unhealthy" is seen on the NSX UI, it might have been triggered due to a connection issue in the channel between NSX Config Agent and the Security Services Platform (SSP) or NSX Application Platform.

Impact: This will result in stale or missed configuration updates on SSP, due to which SSP verticles may start malfunctioning.
For example, Metrics API calls made with the intent path may return an error response for certain queries. Sample error response: {"error_code":950010,"module_name":"Nsx-Metrics","error_message":"API Validation error : Resource type <resource_type> does not match resource id(s) : [resource_ids]"}

 

Environment

NSX 4.2.1 and above

 

Cause

The alarm could be raised due to one of the following reasons. The alarm has a field "View Runtime Details" which will have the exact reason why the alarm was raised.

(1) Connectivity issue between NSX and Security Services Platform (SSP) or NSX Application Platform.
(2) Connectivity Issue with the database.
(3) Failed to sync configuration data from NSX to Security Services Platform (SSP) or NSX Application Platform.

Note: It will try to auto-remediate in the backend every 5 minutes by attempting to send all configuration data again to SSP. If auto-remediation resolves the issue, then the alarm will automatically be marked as "Resolved" on the NSX UI.

If even after 5-7 minutes the alarms do not get resolved, below steps will help in resolving it.

Resolution

The system will try to auto-remediate in the backend every 5 minutes by attempting to send all configuration data again to Security Services Platform (SSP) or NSX Application Platform. If auto-remediation resolves the issue, then the alarm will automatically be marked as "Resolved" on the NSX UI.

If even after 5-7 minutes the alarms do not get resolved, please follow the steps below corresponding to the reasons above to resolve the issues.

 

(1)  Connectivity issue between NSX and Security Services Platform (SSP) or NSX Application Platform. →
While sending configuration data from NSX to the Security Services Platform (SSP) or NSX Application Platform, there is a retry mechanism in place to prevent intermittent connectivity issues. But if there is a prolonged connectivity issue and all the retries are exhausted, an alarm of this type will be raised.

There can be various reasons for this connectivity failure caused from Kafka cert update, including SSL Certificate issues between the NSX and the communication channel (Kafka), or the communication channel being down completely. 

To confirm the alarm's root cause, access the affected NSX Manager's (to get the affected NSX Manager, expand the alarm, and check 'Reported by Node') command-line interface and execute the following command.

cd /var/log/proton

zgrep "Exception happened when receiving instruction from NSX Intelligence" -A 1 nsxapi.*

 

The above command should show many occurrences of below error:

java.lang.IllegalStateException: This consumer has already been closed.

To resolve this issue, please follow the steps below,

 - for Security Services Platform (SSP)
(a) To resolve this issue, login to any SSPI as root user and restart kafka-controller PODs as below,

k -n nsxi-platform rollout restart sts kafka-controller

(b) And check if all the kafka-controller PODs are up on k8s by running the below command,

k -n nsxi-platform get pods | grep kafka-controller


(c) If all the kafka-controller PODs are up, check for any errors in kafka-controller logs and resolve the errors. Logs for kafka-controller can be seen using below command - 

k -n nsxi-platform logs kafka-controller-0 -f
k -n nsxi-platform logs kafka-controller-1 -f


- for NSX Application Platform
(a) Restart the proton service on the affected NSX Manager. It can be done by executing the below command on the affected NSX Manager command-line interface

systemctl restart proton

(b) Post successful restart of the proton service,

 
 Login as root user to NSX CLI and check if all the Kafka PODs are up on k8s by running the below command -

napp-k get pods | grep kafka


Since Kafka is a statefulset, there should be PODs named kafka-0, kafka-1 etc. 

 

 (c) If all the kafka PODs are up, check for any error in kafka logs and resolve the errors. Logs for kafka can be seen using below command - 

napp-k logs kafka-0 -f
napp-k logs kafka-1 -f

 

After issues with the connectivity channel are resolved, this alarm should automatically be resolved on the UI.
If you face any further issues, please contact the ANS Broadcom Support Team to get a proper resolution.

 

(2) Connectivity issue with database tables →
When this alarm is Open on the NSX UI, any newly created configuration data like Groups, Rules, Policies, etc. might not be sent from NSX to the Security Services Platform (SSP) or NSX Application Platform.

This issue can be seen when Corfu is down. To check if Corfu is down, follow steps below - 

(a) Login via admin user to NSX CLI of the primary NSX Manager. And run the below command to verify the status of the datastore. 

get cluster status

(b) If the status of "Datastore" is DOWN, login via root user to Primary NSX and run the below command to restart corfu.

service corfu-server restart

(c) Check the cluster status after 5-7 minutes to check if status of "Datastore" is UP, using below command - 

get cluster status

After Corfu is up, the alarm should get resolved automatically on the NSX UI. If the alarm is not resolved, follow further steps.

(d) If the issue is not yet resolved but "Datastore" is UP, then restart proton using the below command by logging in to NSX CLI using admin user to the primary NSX Manager.

service proton restart


After the proton is restarted, and the alarm is still not resolved, please contact ANS Broadcom Support Team to get a proper resolution on this alarm.

 

(3) Failed to sync configuration data from NSX to Security Services Platform (SSP) or NSX Application Platform → 
There can be multiple reasons for this connectivity failure, including a corner case where the object to be streamed has not been properly realized yet or had some issue while processing it, like late realization of the object on the backend, etc.

- for Security Services Platform (SSP)
To resolve this issue, login to any SSPI as root user and restart nsx-config pods as below,

k -n nsxi-platform rollout restart sts nsx-config-0

k -n nsxi-platform rollout restart sts nsx-config-1

 - for NSX Application Platform
To resolve this issue, login to any NSX Manager as root user and restart nsx-config pod as below,

napp-k rollout restart sts nsx-config


After issues with the connectivity channel are resolved, this alarm should automatically be resolved on the UI.

Verification    
Observe if the new configs are reflected on the Intelligence UI to verify the resolution.

Or look for recent FULL_SYNC_START logs from nsx-config using the below command,

- For Security Services Platform (SSP)
Login to SSPI CLI and run the below commands,

k logs -n nsxi-platform nsx-config-0-0 | grep FULL_SYNC_START

k logs -n nsxi-platform nsx-config-1-0 | grep FULL_SYNC_START

- For NSX Application Platform
Login to NSX Manager CLI and run the below command,

         napp-k logs nsx-config-0 | grep FULL_SYNC_START

Additional Information