NSX manager reported Controller Channel to Antrea Cluster Down Alarm:
And the Distributed Firewall reported "unknown" status for firewall rules
Firewall rule error reported as follows:
Antrea nestdb-lite log:
[nsx@6876 comp="nsx-cluster-control-plane" level="ERROR" s2comp="nsx-rpc" subcomp="nestdb-lite"] EndpointResolver(nsx-nestdb) NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]'s endpoint ####-#### is already registered.
[nsx@6876 comp="nsx-cluster-control-plane" level="INFO" s2comp="nsx-rpc" subcomp="nestdb-lite"] EndpointResolver(nsx-nestdb) Rejecting connection. Policy: &{REJECT_INCOMING_CONNECTION}, incoming connection: NsxRpcConnection[server:614164195 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}], existing connection: NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]
[nsx@6876 comp="nsx-cluster-control-plane" level="WARNING" s2comp="nsx-rpc" subcomp="nestdb-lite"] ServiceResolver(nsx-nestdb) Service vmware.nsx.cli.MetricsCliService is already registered by another connection [NsxRpcConnection[server:1584811423 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}]]
[nsx@6876 comp="nsx-cluster-control-plane" level="INFO" s2comp="nsx-rpc" subcomp="nestdb-lite"] NsxRpcConnection[server:614164195 Local{/var/run/vmware/nestdb/nestdb-server.sock} Remote{@}] close("Rejecting incoming connection")
VMware vSphere Kubernetes Service (VKS) with Antrea Interworking installed.
The reason for the continuous failure of TN proxy to connect to nestdb-lite server is because there is a stale RpcConnection object. Whenever TN proxy is reconnecting to nestdb-lite server and creating a new stub, GoLang NSX RPC library in nestdb-lite server is immediately rejecting the connection since the existing stale RpcConnection object already exists for the client, so it won't create a new one.
Stop and Start Antrea-interworking
Stop:
kubectl scale deployment interworking --replicas=0 -n vmware-system-antrea
kubectl get pods -o wide -n vmware-system-antrea # Check if the interworking pod disappears.
Start:
kubectl scale deployment interworking --replicas=1 -n vmware-system-antrea
kubectl get pods -o wide -n vmware-system-antrea # Check if the interworking pod appears and the state is running.
The issue will be fixed in a later Antrea Interworking release.