Antrea cluster is down with NSX alarm "Control Channel Ton ANtrea Cluster Down Long".
search cancel

Antrea cluster is down with NSX alarm "Control Channel Ton ANtrea Cluster Down Long".

book

Article ID: 401229

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX manager reported Controller Channel to  Antrea Cluster Down Alarm:

And the Distributed Firewall reported "unknown" status for firewall rules 

Firewall rule error reported as follows: 

Antrea nestdb-lite log:

[nsx@6876 comp="nsx-cluster-control-plane" level="ERROR" s2comp="nsx-rpc" subcomp="nestdb-lite"] EndpointResolver(nsx-nestdb) NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]'s endpoint ####-#### is already registered.
[nsx@6876 comp="nsx-cluster-control-plane" level="INFO" s2comp="nsx-rpc" subcomp="nestdb-lite"] EndpointResolver(nsx-nestdb) Rejecting connection. Policy: &{REJECT_INCOMING_CONNECTION}, incoming connection: NsxRpcConnection[server:614164195 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}], existing connection: NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]
[nsx@6876 comp="nsx-cluster-control-plane" level="WARNING" s2comp="nsx-rpc" subcomp="nestdb-lite"] ServiceResolver(nsx-nestdb) Service vmware.nsx.cli.MetricsCliService is already registered by another connection [NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]]
 [nsx@6876 comp="nsx-cluster-control-plane" level="INFO" s2comp="nsx-rpc" subcomp="nestdb-lite"] NsxRpcConnection[server:614164195 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}] close("Rejecting incoming connection")

 

Environment

VMware vSphere Kubernetes Service (VKS) with Antrea Interworking installed. 

Cause

The reason for the continuous failure of TN proxy to connect to nestdb-lite server is because there is a stale RpcConnection object. Whenever TN proxy is reconnecting to nestdb-lite server and creating a new stub, GoLang NSX RPC library in nestdb-lite server is immediately rejecting the connection since the existing stale RpcConnection object already exists for the client, so it won't create a new one. 

Resolution

Stop and Start Antrea-interworking

Stop:

kubectl scale deployment interworking --replicas=0 -n vmware-system-antrea

kubectl get pods -o wide -n vmware-system-antrea # Check if the interworking pod disappears.

Start:

kubectl scale deployment interworking --replicas=1 -n vmware-system-antrea

kubectl get pods -o wide -n vmware-system-antrea # Check if the interworking pod appears and the state is running.

Additional Information

The issue will be fixed in a later Antrea Interworking release.