Newly deployed Edge via NSX API fails to connect to CCP due to Edge Configuration State "Node Not ready".
search cancel

Newly deployed Edge via NSX API fails to connect to CCP due to Edge Configuration State "Node Not ready".

book

Article ID: 385902

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Newly deployed edge nodes via NSX API are stuck in Configuration State 'node_not_ready' state.

The Edge VMs are deployed successfully and able to connect to the NSX Manager on port 1234 but not CCP on port 1235. The Configuration State shows up as "Node Not Ready", Manager Connectivity as "Up", Controller Connectivity as "Down" in the NSX UI.

The manager service is connected successfully whereas the CCP service gives "OTHER_ERROR".

root@NSX_EDGE:~# su admin -c "get managers"
Mon Dec 02 2024 UTC 08:06:44.023
- 10.##.##.01 Connected (NSX-RPC)
- 10.##.##.02 Connected (NSX-RPC) *
- 10.##.##.03 Connected (NSX-RPC)

root@NSX_EDGE:~# su admin -c "get controllers"
Mon Dec 02 2024 UTC 08:06:56.292
 Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
 10.##.##.01 1235 enabled not used false null NA NA
 10.##.##.02 1235 enabled disconnected true down NA OTHER_ERROR <=========== CCP is not UP
 10.##.##.03 1235 enabled not used false null NA NA


GET API for node state shows node state as "NODE_NOT_READY" and failure message as "Waiting for edge node to be ready."

GET https://10.##.##.01/api/v1/transport-nodes/########-4107-####-bb04-############/state
{
    "transport_node_id": "########-4107-####-bb04-############",
    "maintenance_mode_state": "DISABLED",
    "node_deployment_state": {
        "state": "NODE_NOT_READY",
        "failure_message": "",
        "failure_code": -1
    },
    "hardware_version": "vmx-##",
    "state": "pending",
    "details": [
        {
            "sub_system_id": "########-4107-####-bb04-############",
            "sub_system_type": "Host",
            "state": "pending",
            "failure_message": "Waiting for edge node to be ready."
        }
    ]
}


Logs :

To check high count of EdgeConfigUpdateMsg sent by the Edge Nodes to MP.

/var/log/proton/nsxapi.log

Minute wise count of EdgeConfigUpdateMsg sent from all edge nodes.

/var/log/proton$ grep "Receive EdgeConfigUpdateMsg" nsxapi* | grep "2024-12-01T09:21" | wc -l
173

This tells approximately 173 EdgeConfigUpdateMsg are sent per minute by all edge nodes.


Per edge node per minute EdgeConfigUpdateMsg count.

/var/log/proton$ grep "EdgeConfigUpdateMsg for fabric edge node" nsxapi* | grep "2024-12-01T09:21" | grep ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:15.352Z  INFO EdgeTNRpcRequestRouter2 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:35.972Z  INFO EdgeTNRpcRequestRouter5 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:59.148Z  INFO EdgeTNRpcRequestRouter4 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############

This tells approximately 3 EdgeConfigUpdateMsg are sent by an edge node, even though there is no config changes on edge node. 

On a scale setup, Edge MP gets overloaded with a lot of EdgeConfigUpdateMsgs from Edge (every 5 seconds) which it could not ACK on time. Because of this, incoming AppInitMsgs from new edge nodes are not replied by manager.

Environment

VMware NSX

Cause

Edge is sending high count of EdgeConfigUpdateMsgs at short interval of 5 seconds. If reply is not received from MP, then Edge re-sends EdgeConfigUpdateMsgs again.

In scale setup EdgeConfigUpdateMsgs count becomes high on manager nodes and manager gets overloaded. Because of this incoming AppInitMsgs from new Edge nodes are not replied by manager. This causes Edge node to be stuck in node_not_ready state.

Resolution

Impacted Version : NSX 4.x

Fixed Version : This issue is fixed in future release of NSX.

Workaround : Reboot all 3 manager nodes so that message queue becomes empty and AppInitMsgs from new Edge nodes will get reply.