Frequent pod restart observed after vMotion
search cancel

Frequent pod restart observed after vMotion

book

Article ID: 373353

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Node agent throws error NCP01012 with error "Agent is exiting as connection is unavailable" which causes pods to restart.

/var/log/nsx-ujo/nsx_node_agent.log:
66770 2024-07-16T06:35:39.731Z #######-####-####-####-############ NSX 872887 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] nsx_ujo.agent.agent Agent is unavailable for 30 seconds: connection inactive.hyperbus service inactive., retrying
66771 2024-07-16T06:35:44.738Z #######-####-####-####-############ NSX 872887 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01012"] nsx_ujo.agent.agent Agent is exiting as connection is unavailable
66772 2024-07-16T06:35:44.830Z #######-####-####-####-############ NSX 872858 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo_privsep.comm Unexpected error: <class 'OSError'>

Environment


NCP 4.1.X
NSX Version: 4.1.X
Impact: Pod restarts are observed across clients spread across clusters

Cause

On vMotion, node agent closes existing rpc connection with CfgAgent. Connection may be in sleep state for upto 40 secs which may block/delays connection close for that long. While NCP health check detects if node agent remains disconnected for more than 30 secs it restarts the pod.

Resolution

No workaround available

Fix version NSX 4.2.1