Controller Connectivity Down on newly created Edge and Configuration State Node Not Ready
search cancel

Controller Connectivity Down on newly created Edge and Configuration State Node Not Ready

book

Article ID: 385902

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Newly deployed edge nodes via NSX API are stuck in Configuration State 'node_not_ready' state.

The Edge VMs are deployed successfully and able to connect to the NSX Manager on port 1234 but not CCP on port 1235. The Configuration State shows up as "Node Not Ready", Manager Connectivity as "Up", Controller Connectivity as "Down" in the NSX UI.

The manager service is connected successfully whereas the CCP service gives "OTHER_ERROR".

root@NSX_EDGE:~# su admin -c "get managers"
Mon Dec 02 2024 UTC 08:06:44.023
- 10.##.##.01 Connected (NSX-RPC)
- 10.##.##.02 Connected (NSX-RPC) *
- 10.##.##.03 Connected (NSX-RPC)

root@NSX_EDGE:~# su admin -c "get controllers"
Mon Dec 02 2024 UTC 08:06:56.292
 Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
 10.##.##.01 1235 enabled not used false null NA NA
 10.##.##.02 1235 enabled disconnected true down NA OTHER_ERROR <=========== CCP is not UP
 10.##.##.03 1235 enabled not used false null NA NA


GET API for node state shows node state as "NODE_NOT_READY" and failure message as "Waiting for edge node to be ready."

GET https://10.##.##.01/api/v1/transport-nodes/########-4107-####-bb04-############/state
{
    "transport_node_id": "########-4107-####-bb04-############",
    "maintenance_mode_state": "DISABLED",
    "node_deployment_state": {
        "state": "NODE_NOT_READY",
        "failure_message": "",
        "failure_code": -1
    },
    "hardware_version": "vmx-##",
    "state": "pending",
    "details": [
        {
            "sub_system_id": "########-4107-####-bb04-############",
            "sub_system_type": "Host",
            "state": "pending",
            "failure_message": "Waiting for edge node to be ready."
        }
    ]
}


Logs :

To check high count of EdgeConfigUpdateMsg sent by the Edge Nodes to MP.

/var/log/proton/nsxapi.log

Minute wise count of EdgeConfigUpdateMsg sent from all edge nodes.

/var/log/proton$ grep "Receive EdgeConfigUpdateMsg" nsxapi* | grep "2024-12-01T09:21" | wc -l
173

This tells approximately 173 EdgeConfigUpdateMsg are sent per minute by all edge nodes.


Per edge node per minute EdgeConfigUpdateMsg count.

/var/log/proton$ grep "EdgeConfigUpdateMsg for fabric edge node" nsxapi* | grep "2024-12-01T09:21" | grep ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:15.352Z  INFO EdgeTNRpcRequestRouter2 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:35.972Z  INFO EdgeTNRpcRequestRouter5 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############
nsxapi.2.log:2024-12-01T09:21:59.148Z  INFO EdgeTNRpcRequestRouter4 EdgeTNConfigUpdateRequestHandler 77172 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message: EdgeConfigUpdateMsg for fabric edge node: ########-b435-####-a299-############

This tells approximately 3 EdgeConfigUpdateMsg are sent by an edge node, even though there is no config changes on edge node. 

On a scale setup, Edge MP gets overloaded with a lot of EdgeConfigUpdateMsgs from Edge (every 5 seconds) which it could not ACK on time. Because of this, incoming AppInitMsgs from new edge nodes are not replied by manager.

Cause

Edge is sending high count of EdgeConfigUpdateMsgs at short interval of 5 seconds. If reply is not received from MP, then Edge re-sends EdgeConfigUpdateMsgs again.

EdgeConfigUpdateMsgs count is high on manager nodes, leading to manager overload. New Edge node AppInitMsgs are not replied, resulting in Edge nodes stuck in a node_not_ready state.

Resolution

This issue is resolved in VMware NSX 4.2.2 available at available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB

Workaround:

  • Reboot all 3 manager nodes so that message queue becomes empty and AppInitMsgs from new Edge nodes will get reply.