NSX-T manager alarms showing TN Flow Exp Disconnected
search cancel

NSX-T manager alarms showing TN Flow Exp Disconnected

book

Article ID: 311854

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:
-- You will see similar alarms on NSX manager UI on Home >> Alarms page.

The flow exporter on Transport node 29d0 is disconnected from the NSX Application Platform cluster's messaging broker. Data collection is affected.



-- You will see similar symptoms in /var/log/nsx-syslog.log

2024-08-27T19:11:09.208Z Er(179) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111426" level="ERROR" errorCode="MPA11014"] nsxintel:Kafka message delivery failed, error: Local: Message timed out

2024-08-27T18:06:57Z Wa(180) nsx-sha: NSX 2112111 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="WARNING" s2comp="tsdb-sender-napp"] Failed to send one msg timestamp: 1724781722
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: entity: SEGMENT_PORT
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: entity_id: "6262675d-4474-463e-b43c-df236ca32fb4"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: node_id: "a5e4fb86-5017-4eed-a057-ed935529d920"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: nsx_site_id: "6db0ec8b-76bd-4315-896a-e55864b2a366"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: dfw_lsp {
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   obj_id: "lsp_stats"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   number_of_sessions: 224332612
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   number_of_bytes: 19531189017264
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: }
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:  from plugin 7eaa66b3-60c0-4f23-884f-b5309b8ab2cd:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:  <_InactiveRpcError of RPC that terminated with:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        status = StatusCode.UNAUTHENTICATED
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        details = ""
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        debug_error_string = "{"created":"@1724782017.581426466","description":"Error received from peer ipv4:172.24.27.104:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: >
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:  Traceback (most recent call last):
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   File "/usr/lib/vmware/netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 671, in send_metrics
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:     response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   File "/usr/lib/vmware/netopa/lib/python/grpc/_channel.py", line 946, in __call__
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:     return _end_unary_response_blocking(state, call, False, None)
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:   File "/usr/lib/vmware/netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:     raise _InactiveRpcError(state)
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        status = StatusCode.UNAUTHENTICATED
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        details = ""
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha:        debug_error_string = "{"created":"@1724782017.581426466","description":"Error received from peer ipv4:172.24.27.104:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"

-- Enable verbose logging on host. (Revert this change to info, once logs are collected)

/opt/vmware/nsx-cli/bin/nsx-appctl -t /var/run/vmware/exporter/common-exporter-cli set/loglevel verbose

-- You will see similar symptoms in /var/log/nsx-syslog.log

2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111456" level="DEBUG"] rdkafka: CONNECT: [thrd:main]: ssl://x.x.x.x:9092/0: Selected for cluster connection: refresh unavailable topics (broker has 283142 connection attempt(s))
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111456" level="DEBUG"] rdkafka: CONNECT: [thrd:main]: Not selecting any broker for cluster connection: still suppressed for 49ms: no cluster connection
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Received CONNECT op
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state INIT -> TRY_CONNECT
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: broker in state TRY_CONNECT connecting
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state TRY_CONNECT -> CONNECT
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Connecting to ipv4#x.x.x.x:9092 (ssl) with socket 70
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Connected to ipv4#x.x.x.x:9092
2024-08-27T18:53:14.044Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: FAIL: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: SSL handshake failed: s3_pkt.c:1498: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46 (after 8ms in state CONNECT) (_SSL): identical to last error: error log suppressed
2024-08-27T18:53:14.044Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state CONNECT -> DOWN

-- You will also see 0 flows being Acknowledged on ESXi host.

[root@esxi:~] nsxcli -c get intelligence flows stats ack

Tue Aug 27 2024 UTC 19:05:07.613
             NSX Intelligence Host Flows Acknowledgement Statistics
--------------------------------------------------------------------------------
     host uuid: a5e4fb86-5017-4eed-a057-ed935529d920
     host type: nsx-esx(1)

  Total Sent     Total Ack'ed      Last Sent      Last Ack'ed      Last Sent Time
    511247             0                        77                  0            2024-08-27 19:01:07

-- Enable debug logging on kafka stateful set on NSX Application Platform. (Revert this changes after log bundle is collected)

--SSH to NSX manager

   a) napp-k edit sts Kafka 
   b) with existing JMX parameters 
      - name: KAFKA_JMX_OPTS
          value: '-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false  -Dcom.sun.management.jmxremote.ssl=false
            -Dcom.sun.management.jmxremote.rmi.port=30001 '

     add one more parameter -Djavax.net.debug=ssl:handshake
   c) save and exit. This will restart Kafka pods with debug enabled for SSL handshake. 

-- You will see similar symptoms in kafka pod logs.

024-09-10T14:58:07.51011165Z stderr F javax.net.ssl|FINE|34|data-plane-kafka-network-thread-0-ListenerName(EXTERNAL)-SSL-8|2024-09-10 14:58:07.509 GMT|CertificateMessage.java:372|Consuming client Certificate handshake message (

2024-09-10T14:58:07.510126826Z stderr F "Certificates": [
2024-09-10T14:58:07.510130112Z stderr F   "certificate" : {
2024-09-10T14:58:07.510132713Z stderr F     "version"            : "v3",
2024-09-10T14:58:07.510135661Z stderr F     "serial number"      : "00 88 F5 7F D5 24 80 CE 11",
2024-09-10T14:58:07.510141803Z stderr F     "signature algorithm": "SHA256withRSA",
2024-09-10T14:58:07.510146628Z stderr F     "issuer"             : "UID=a5e4fb86-5017-4eed-a057-ed935529d920, CN=VMware-NSX-Host, [email protected], O="VMware, Inc.", L=Palo Alto, ST=California, C=US",
2024-09-10T14:58:07.510148922Z stderr F     "not before"         : "2021-03-22 19:51:36.000 GMT",
2024-09-10T14:58:07.510151113Z stderr F     "not  after"         : "2031-03-20 19:51:36.000 GMT",
2024-09-10T14:58:07.510153609Z stderr F     "subject"            : "UID=a5e4fb86-5017-4eed-a057-ed935529d920, CN=VMware-NSX-Host, [email protected], O="VMware, Inc.", L=Palo Alto, ST=California, C=US",
2024-09-10T14:58:07.510156181Z stderr F     "subject public key" : "RSA",
2024-09-10T14:58:07.510158876Z stderr F     "extensions"         : [
2024-09-10T14:58:07.51016122Z stderr F       {
...
...
2024-09-10T14:58:07.510717546Z stderr F javax.net.ssl|SEVERE|34|data-plane-kafka-network-thread-0-ListenerName(EXTERNAL)-SSL-8|2024-09-10 14:58:07.510 GMT|TransportContext.java:323|Fatal (CERTIFICATE_UNKNOWN): PKIX path validation failed: java.security.cert.CertPathValidatorException: signature check failed (
2024-09-10T14:58:07.510728452Z stderr F "throwable" : {
2024-09-10T14:58:07.510732413Z stderr F   sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: signature check failed
2024-09-10T14:58:07.510735399Z stderr F         at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:386)
2024-09-10T14:58:07.510738075Z stderr F         at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:291)
2024-09-10T14:58:07.51074041Z stderr F          at sun.security.validator.Validator.validate(Validator.java:271)
2024-09-10T14:58:07.510742757Z stderr F         at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315

-- Verify if above host cert seen in kafka log is matching with host cert on ESXi.

[root@ESXi:~] openssl x509 -noout -text -in /etc/vmware/nsx/host-cert.pem

[Output Truncated]

Validity
            Not Before: Aug 14 17:04:26 2024 GMT
            Not After : Nov 17 17:04:26 2026 GMT

 

 

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center 4.x
NSX Application Platform 4.1.2

NSX Application Platform 4.2.0

 

Cause

This is known issue affecting NSX Application Platform 4.1.2. This happens when nsx-exporter on Esxi host is unable to identify the recently generated certificates and there is a mismatch of host certificate and host certificate on kafka.

Resolution

Verify and ensure the transport node reporting this alarm has connection to the NAPP broker. 

-- Login to the NSX-T manager > System > Configuration > NSX Application Platform. Note down the Ingress URL and Messaging URL domain and port number. Ingress URL should be on port 443 while Messaging URL is on port 9092.

-- Run the following commands on the transport node reporting this alarm to verify connection is successful. If it fails, open the firewall between host and NAPP URLs.

nc -zv <Ingress URL> 443

nc -zv <Messaging URL> 9092

-- To resolve the alarms, restart the following services on all affected ESXi hosts which has certificate mismatch.
1. /etc/init.d/nsx-exporter restart
2. /etc/init.d/nsx-opsagent restart
3. /etc/init.d/netopad restart