Antrea container clusters are down following the change of the NSX Principal Identity Certificate
search cancel

Antrea container clusters are down following the change of the NSX Principal Identity Certificate

book

Article ID: 301569

calendar_today

Updated On:

Products

VMware NSX VMware Container Networking with Antrea

Issue/Introduction

  • After renewing the principal identity certificate in Antrea container, Antrea container clusters status are down.
  • Antrea inter networking pod is up but getting below messages in Antrea containers.

    mp-adapter:

    I0725 07:28:51.594951     15 inventory_controller.go:441] Adding Service key igcb-quantum-inf-cbsportal/igcb-rbi-cbpdoapi to inventory object queue
    I0725 07:28:51.666972     15 controller.go:152] tryGetNSXRPCStubs called
    I0725 07:28:51.666999     15 rpc_client_impl.go:42] NsxRpcClientImpl: Cannot find master aph uuid.
    I0725 07:28:51.667005     15 controller.go:215] Failed to get NSX-RPC stubs, retrying: Can not find MasterAph uuid

    tn-proxy:

    2023-07-25T07:30:24.919Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="WARNING"] RpcConnection[213 Connecting to unix:///var/run/vmware/nsx-agent/threat-detector.sock 0] Couldn't connect to unix:///var/run/vmware/nsx-agent/threat-detector.sock (error: 2-No such file or directory)
    2023-07-25T07:30:24.919Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="WARNING"] RpcTransport[0] Unable to connect to unix:///var/run/vmware/nsx-agent/threat-detector.sock: 2-No such file or directory
    2023-07-25T07:30:24.919Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="INFO"] ConnectionKeeper[2 unix:///var/run/vmware/nsx-agent/threat-detector.sock] scheduling connection attempt in 5000 ms
    2023-07-25T07:30:25.965Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="INFO"] ConnectionKeeper[5 ssl://#########:1234] attempting connection from timer callback
    2023-07-25T07:30:25.966Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="INFO"] ConnectionKeeper[4 ssl://#########:1234] attempting connection from timer callback
    2023-07-25T07:30:25.966Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="INFO"] ConnectionKeeper[6 ssl://#########:1234] attempting connection from timer callback
    2023-07-25T07:30:25.996Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-net" tid="142" level="WARNING"] StreamConnection[216 Connecting to ssl://##########:1234 sid:216] Couldn't connect to 'ssl://172.24.87.9:1234' (error: 335544539-short read)
    2023-07-25T07:30:25.996Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-net" tid="142" level="WARNING"] StreamConnection[216 Error to ssl://##########:1234 sid:-1] Error 335544539-short read
    2023-07-25T07:30:25.996Z [nsx@6876 comp="nsx-bms" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="142" level="WARNING"] RpcConnection[216 Connecting to ssl://##########:1234 0] Couldn't connect to ssl://##########:1234 (error: 335544539-short read)

Cause

Updating the PI certificate is not enough. This is because there are two channels which are using this certificate, one is REST API, and another one is NSX-RPC. Updating the PI certificate from NSX API/UI only updates the REST API certificate, NSX-RPC communication will be broken. Antrea security policy, group membership query, monitoring, traceflow, support bundle are operated on NSX-RPC channel. So if you replace only PI certificate, these features will be broken.

Resolution

This issue is resolved in VMware NSX 4.2.0

Workaround:

Either we should use a long expiration time for Antrea-NSX certificate. If the certificate is expiring, and you already updated the PI certificate, you can do the following steps to delete the cluster registration, and register again using the new certificate.

  1. Delete the cluster registration
    See Deregister an Antrea Kubernetes Cluster from NSX for detailed instructions

  2. Register again using the new certificate.
    Follow the steps in "What to do next" section of the above document. It describes the steps to register this cluster again to NSX.

Note: After the cluster is deregistered, its reference is removed from all security policies. After you register it again, you need to modify these security policies' applied-to (or container cluster scope in NSX API), and add the re-registered cluster to the applied-to again.

Note: Before deleting container cluster registration, you should note which Antrea policies refer to this container cluster. After re-register the cluster, you can add the cluster back to those Antrea policies.