Controller Connectivity Down on newly created Edge and Configuration State Node Not Ready

search cancel

Controller Connectivity Down on newly created Edge and Configuration State Node Not Ready

book

Article ID: 385902

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Newly deployed edge nodes via NSX API are stuck in Configuration State 'node_not_ready' state.

The existing edges CPU usages become high.

The Edge VMs are deployed successfully and able to connect to the NSX Manager on port 1234 but not CCP on port 1235. The Configuration State shows up as "Node Not Ready", Manager Connectivity as "Up", Controller Connectivity as "Down" in the NSX UI.

The manager service is connected successfully whereas the CCP service gives "OTHER_ERROR".

root@NSX_EDGE:~# su admin -c "get managers"
Mon Dec 02 2024 UTC 08:06:44.023
- 10.##.##.01 Connected (NSX-RPC)
- 10.##.##.02 Connected (NSX-RPC) *
- 10.##.##.03 Connected (NSX-RPC)

root@NSX_EDGE:~# su admin -c "get controllers"
Mon Dec 02 2024 UTC 08:06:56.292
 Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
 10.##.##.01 1235 enabled not used false null NA NA
 10.##.##.02 1235 enabled disconnected true down NA OTHER_ERROR <=========== CCP is not UP
 10.##.##.03 1235 enabled not used false null NA NA

GET API for node state shows node state as "NODE_NOT_READY" and failure message as "Waiting for edge node to be ready."

GET https://10.##.##.01/api/v1/transport-nodes/########-4107-####-bb04-############/state
{
    "transport_node_id": "########-4107-####-bb04-############",
    "maintenance_mode_state": "DISABLED",
    "node_deployment_state": {
        "state": "NODE_NOT_READY",
        "failure_message": "",
        "failure_code": -1
    },
    "hardware_version": "vmx-##",
    "state": "pending",
    "details": [
        {
            "sub_system_id": "########-4107-####-bb04-############",
            "sub_system_type": "Host",
            "state": "pending",
            "failure_message": "Waiting for edge node to be ready."
        }
    ]
}

Logs

To check high count of EdgeConfigUpdateMsg or EdgeSystemInfoMsg sent by the Edge Nodes to MP:

Minute wise count of EdgeConfigUpdateMsg or EdgeSystemInfoMsg sent from all edge nodes:

/var/log/proton$ grep "Receive EdgeConfigUpdateMsg" nsxapi* | grep "2024-12-01T09:21" | wc -l
173

/var/log/proton$ grep "Receive EdgeSystemInfoMsg" nsxapi* | grep "2025-08-04T04:36" | wc -l
380

This tells approximately 173 EdgeConfigUpdateMsg / 380 EdgeSystemInfoMsg are sent per minute by all edge nodes.

Per edge node per minute EdgeConfigUpdateMsg or EdgeSystemInfoMsg count:

/var/log/proton$ grep "EdgeConfigUpdateMsg for fabric edge node" nsxapi* | grep "2024-12-01T09:21" | grep ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:15.352Z  INFO EdgeTNRpcRequestRouter2 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:35.972Z  INFO EdgeTNRpcRequestRouter5 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:59.148Z  INFO EdgeTNRpcRequestRouter4 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############

This tells approximately 3 EdgeConfigUpdateMsg / 8 EdgeSystemInfoMsg are sent by an edge node, even though there is no config changes on edge node.

Queue Saturation Evidence:

grep "Failed to handle message due to RejectedExecutionException Task" nsxapi.*

Example output showing ThreadPoolExecutor at maximum capacity:

nsxapi.1.log:2025-08-18T20:28:55.767Z ERROR nsx-rpc:RPC_PROXY_CONN_PROVIDER:user-executor-9 InboundMessageRouter 1069292 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP4002" level="ERROR" subcomp="manager"] Failed to handle message due to RejectedExecutionException Task com.vmware.nsx.messaging.service.impl.InboundMessageRouter$HandlerExecutor$$Lambda$2463/0x000075771ce7a040@67c18fc7 rejected from java.util.concurrent.ThreadPoolExecutor@2021deef[Running, pool size = 5, active threads = 5, queued tasks = 1000, completed tasks = 2057669]. <session-id>:<message-id> = 'null:null', clientType = 'cvn-edge', application = 'EdgeVertical'

AppInitHandshake timeouts seen on newly created Edge:

/var/log$ grep "AppInitHandshake timed-out" syslog
2025-07-30T06:47:43.614Z edge_node NSX 3239 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3513" level="INFO"] [EdgeVertical] AppInitHandshake timed-out. Seq (1)
2025-07-30T06:48:13.614Z edge_node NSX 3239 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3515" level="INFO"] [EdgeVertical] AppInitHandshake timed-out. Seq (2)
2025-07-30T06:48:43.615Z edge_node NSX 3239 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3516" level="INFO"] [EdgeVertical] AppInitHandshake timed-out. Seq (3)

Environment

VMware NSX

Cause

Edge is sending high count of EdgeConfigUpdateMsgs or EdgeSystemInfoMsgs at short interval of 5 seconds. If reply is not received from MP, then Edge re-sends EdgeConfigUpdateMsgs / EdgeSystemInfoMsgs again.

EdgeConfigUpdateMsgs or EdgeSystemInfoMsg count is high on manager nodes, leading to manager overload. New Edge node AppInitMsgs are not replied, resulting in Edge nodes stuck in a node_not_ready state.

On a scale setup, Edge MP gets overloaded with a lot of EdgeConfigUpdateMsgs or EdgeSystemInfoMsg from Edge (every 5 seconds) which it could not ACK on time. Because of this, incoming AppInitMsgs from new edge nodes are not replied by manager. AppInitHandshake timed-outs are seen on newly created Edge.

Resolution

The EdgeConfigUpdateMsg and EdgeSystemInfoMsg issues are resolved in VMware NSX 4.2.3 available at Broadcom Downloads. If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workarounds:

Option 1: Manager Reboot

Reboot all 3 manager nodes so that message queue becomes empty and AppInitMsgs from new edge nodes will get reply.

Option 2: Ops Agent Service Management (Recommended)

Stop the ops agent service on standby edge nodes to reduce the number of messages coming in from the edges:

# stop service nsx-opsagent
# start service nsx-opsagent

Implementation Steps:

Identify standby edge nodes in your environment
SSH to each standby edge node
Execute the stop command: stop service nsx-opsagent
Monitor message volume using the grep commands provided above
Once new edge deployments are successful, restart the service: start service nsx-opsagent
Remove any unneeded edge nodes from the environment to reduce overall message load

Important: After completing your edge node deployments, restart the nsx-opsagent service on the standby edges where it was stopped to restore normal operations and monitoring capabilities.

Monitoring: Use the provided grep commands to monitor EdgeConfigUpdateMsg and EdgeSystemInfoMsg volumes before and after implementing the workaround to verify message reduction.

Additional Information

For additional workaround, please also refer NSX Manager cluster intermittently goes into degraded state and NSX UI becomes inaccessible with error code 101

Feedback

thumb_up Yes

thumb_down No