Pace Agent Unhealthy — Config Full Sync Failure Due to Corfu StreamingExceptions
search cancel

Pace Agent Unhealthy — Config Full Sync Failure Due to Corfu StreamingExceptions

book

Article ID: 435813

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

The Policy Pace Agent, within the NSX Manager, is responsible for streaming NSX configuration state to the Security Services Platform (SSP). This process encompasses two phases: an initial full sync (bulk replication of all current config objects) followed by delta streaming (incremental change propagation).

 

Environment

NSX, SSP 

Cause

Due to larger NSX configuration corpus or infra slowness, the full sync phase can take unusually high time ( more than 20 minutes). Because of this delayed full sync process, Corfu streaming session may timeout and raise a StreamingException. This exception is unrecoverable within the current session, and hence the Pace Agent transitions to an unhealthy state, aborts the in-progress full sync. It does get restarted after sometime, but fails with same reason and agent goes into loop of full syncs.

Resolution

Prerequisites

Before applying any of the remediations below, identify the leader node for the Intelligence Agent service. All configuration API calls must invoke on the the leader node. 

Identify the Intelligence Agent leader: 

Run the below commad from NSX manager cli

 su admin -c "get cluster status verbose" | grep 'INTELLIGENCE_AGENT_SERVICE'

Example output:

 
INTELLIGENCE_AGENT_SERVICE - - b3870142-bc15-eb87-549c-XXXXXXXXXXXX 24484

The UUID field ( b3870142-bc15-eb87-549c-XXXXXXXXXXXX ) is the node ID of the leader. Resolve this to an IP address via your cluster node mapping and use that IP for all subsequent API calls.

Remediation

1. Enable Advanced Streaming

Applicability

NSX 4.2.3 only (NSX 4.2.4+ and 9.x have advanced streaming enabled by default; this step is a no-op on those versions.)

Apply on the Intelligence Agent leader node

 
PATCH https://<intelligence-agent-leader-nsx-ip>/policy/api/v1/system-config Content-Type: application/json { "keyValuePairs": [ { "key": "paceagent.advance.fullstreaming.enabled", "value": "true" }, { "key": "paceagent.advance.deltastreaming.enabled", "value": "true" } ] }

This setting is persistent and does not need to be reverted.

2a. Relax Kafka Producer Acknowledgement Requirements

Applicability

NSX 4.2.3+, 9.0.2, 9.1+
(For earlier versions, see 2b below)


By default, the Kafka producer in the Pace Agent communication service is configured with acks=all, requiring acknowledgement from all in-sync replicas (ISR) before a produce request is considered committed. Under high-throughput replication, this introduces per-message round-trip latency that accumulates over the duration of a full sync and contributes to Corfu session timeout.
Setting relax.kafka.producer.acks=true downgrades the producer acknowledgement to acks=1 (leader-only ack), substantially reducing per-message write latency and overall sync duration.

Warning: This is a temporary mitigation. Relaxed producer acks reduces durability guarantees. If the Kafka leader fails mid-sync, acknowledged messages may be lost, requiring a re-sync. Revert this setting immediately once the Pace Agent returns to healthy state and config sync is confirmed complete.

Apply on the Intelligence Agent leader node

 
PATCH https://<intelligence-agent-leader-ip>/policy/api/v1/system-config Content-Type: application/json { "keyValuePairs": [ { "key": "paceagent.communicationservice.relax.kafka.producer.acks", "value": "true" } ] }

Revert after sync completes

 
PATCH https://<intelligence-agent-leader-ip>/policy/api/v1/system-config Content-Type: application/json { "keyValuePairs": [ { "key": "paceagent.communicationservice.relax.kafka.producer.acks", "value": "false" } ] }


2b. Relax Kafka Topic Replication Constraints (For earlier NSX Versions)

Applicability

NSX 4.2.0, 4.2.1, 4.2.2, 9.0.0, 9.0.1

For versions that do not support the producer acks config via the system-config API, equivalent durability relaxation can be achieved by reducing min.insync.replicas on the relevant Kafka topics.
Please contact Broadcom Technical Support to assist you in safely executing this mitigation step.

Note : 2b has kafka CLI command which should be executed under supervision of support team.

Recovery Verification

After applying the applicable mitigations, monitor Pace Agent health and config sync progress in the SSP UI.

  1. Sign in to SSP.
  2. Go to System → NSX Managers.
  3. Open the NSX Manager instance that was unhealthy or out of sync.
  4. Confirm Readiness is Ready.

Note: Readiness can take 30 minutes to 2 hours to turn Ready, depending on NSX configuration size, cluster load etc.