Symptoms:
-- You will see similar alarms on NSX manager UI on Home >> Alarms page.
The flow exporter on Transport node 29d0 is disconnected from the NSX Application Platform cluster's messaging broker. Data collection is affected.
-- You will see similar symptoms in /var/log/nsx-syslog.log
2024-08-27T19:11:09.208Z Er(179) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111426" level="ERROR" errorCode="MPA11014"] nsxintel:Kafka message delivery failed, error: Local: Message timed out
2024-08-27T18:06:57Z Wa(180) nsx-sha: NSX 2112111 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="WARNING" s2comp="tsdb-sender-napp"] Failed to send one msg timestamp: 1724781722
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: entity: SEGMENT_PORT
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: entity_id: "6262675d-4474-463e-b43c-df236ca32fb4"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: node_id: "a5e4fb86-5017-4eed-a057-ed935529d920"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: nsx_site_id: "6db0ec8b-76bd-4315-896a-e55864b2a366"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: dfw_lsp {
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: obj_id: "lsp_stats"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: number_of_sessions: 224332612
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: number_of_bytes: 19531189017264
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: }
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: from plugin 7eaa66b3-60c0-4f23-884f-b5309b8ab2cd:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: <_InactiveRpcError of RPC that terminated with:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: status = StatusCode.UNAUTHENTICATED
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: details = ""
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: debug_error_string = "{"created":"@1724782017.581426466","description":"Error received from peer ipv4:172.24.27.104:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: >
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: Traceback (most recent call last):
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: File "/usr/lib/vmware/netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 671, in send_metrics
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: File "/usr/lib/vmware/netopa/lib/python/grpc/_channel.py", line 946, in __call__
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: return _end_unary_response_blocking(state, call, False, None)
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: File "/usr/lib/vmware/netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: raise _InactiveRpcError(state)
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: status = StatusCode.UNAUTHENTICATED
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: details = ""
2024-08-27T18:06:57Z Wa(180)[+] nsx-sha: debug_error_string = "{"created":"@1724782017.581426466","description":"Error received from peer ipv4:172.24.27.104:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"
-- Enable verbose logging on host. (Revert this change to info
, once logs are collected)
/opt/vmware/nsx-cli/bin/nsx-appctl -t /var/run/vmware/exporter/common-exporter-cli set/loglevel verbose
-- You will see similar symptoms in /var/log/nsx-syslog.log
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111456" level="DEBUG"] rdkafka: CONNECT: [thrd:main]: ssl://x.x.x.x:9092/0: Selected for cluster connection: refresh unavailable topics (broker has 283142 connection attempt(s))
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111456" level="DEBUG"] rdkafka: CONNECT: [thrd:main]: Not selecting any broker for cluster connection: still suppressed for 49ms: no cluster connection
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Received CONNECT op
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state INIT -> TRY_CONNECT
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: broker in state TRY_CONNECT connecting
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state TRY_CONNECT -> CONNECT
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Connecting to ipv4#x.x.x.x:9092 (ssl) with socket 70
2024-08-27T18:53:14.035Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: CONNECT: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Connected to ipv4#x.x.x.x:9092
2024-08-27T18:53:14.044Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: FAIL: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: SSL handshake failed: s3_pkt.c:1498: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown: SSL alert number 46 (after 8ms in state CONNECT) (_SSL): identical to last error: error log suppressed
2024-08-27T18:53:14.044Z Db(183) nsx-exporter[2111336]: NSX 2111336 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2111461" level="DEBUG"] rdkafka: STATE: [thrd:ssl://x.x.x.x:9092/0]: ssl://x.x.x.x:9092/0: Broker changed state CONNECT -> DOWN
-- You will also see 0 flows being Acknowledged on ESXi host.
[root@esxi:~] nsxcli -c get intelligence flows stats ack
Tue Aug 27 2024 UTC 19:05:07.613
NSX Intelligence Host Flows Acknowledgement Statistics
--------------------------------------------------------------------------------
host uuid: a5e4fb86-5017-4eed-a057-ed935529d920
host type: nsx-esx(1)
Total Sent Total Ack'ed Last Sent Last Ack'ed Last Sent Time
511247 0 77 0 2024-08-27 19:01:07
-- Enable debug logging on kafka stateful set on NSX Application Platform. (Revert this changes after log bundle is collected)
--SSH to NSX manager
a) napp-k edit sts Kafka
b) with existing JMX parameters
- name: KAFKA_JMX_OPTS
value: '-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.rmi.port=30001 '
add one more parameter -Djavax.net.debug=ssl:handshake
c) save and exit. This will restart Kafka pods with debug enabled for SSL handshake.
-- You will see similar symptoms in kafka pod logs.
024-09-10T14:58:07.51011165Z stderr F javax.net.ssl|FINE|34|data-plane-kafka-network-thread-0-ListenerName(EXTERNAL)-SSL-8|2024-09-10 14:58:07.509 GMT|CertificateMessage.java:372|Consuming client Certificate handshake message (
2024-09-10T14:58:07.510126826Z stderr F "Certificates": [
2024-09-10T14:58:07.510130112Z stderr F "certificate" : {
2024-09-10T14:58:07.510132713Z stderr F "version" : "v3",
2024-09-10T14:58:07.510135661Z stderr F "serial number" : "00 88 F5 7F D5 24 80 CE 11",
2024-09-10T14:58:07.510141803Z stderr F "signature algorithm": "SHA256withRSA",
2024-09-10T14:58:07.510146628Z stderr F "issuer" : "UID=a5e4fb86-5017-4eed-a057-ed935529d920, CN=VMware-NSX-Host, [email protected], O="VMware, Inc.", L=Palo Alto, ST=California, C=US",
2024-09-10T14:58:07.510148922Z stderr F "not before" : "2021-03-22 19:51:36.000 GMT",
2024-09-10T14:58:07.510151113Z stderr F "not after" : "2031-03-20 19:51:36.000 GMT",
2024-09-10T14:58:07.510153609Z stderr F "subject" : "UID=a5e4fb86-5017-4eed-a057-ed935529d920, CN=VMware-NSX-Host, [email protected], O="VMware, Inc.", L=Palo Alto, ST=California, C=US",
2024-09-10T14:58:07.510156181Z stderr F "subject public key" : "RSA",
2024-09-10T14:58:07.510158876Z stderr F "extensions" : [
2024-09-10T14:58:07.51016122Z stderr F {
...
...
2024-09-10T14:58:07.510717546Z stderr F javax.net.ssl|SEVERE|34|data-plane-kafka-network-thread-0-ListenerName(EXTERNAL)-SSL-8|2024-09-10 14:58:07.510 GMT|TransportContext.java:323|Fatal (CERTIFICATE_UNKNOWN): PKIX path validation failed: java.security.cert.CertPathValidatorException: signature check failed (
2024-09-10T14:58:07.510728452Z stderr F "throwable" : {
2024-09-10T14:58:07.510732413Z stderr F sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: signature check failed
2024-09-10T14:58:07.510735399Z stderr F at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:386)
2024-09-10T14:58:07.510738075Z stderr F at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:291)
2024-09-10T14:58:07.51074041Z stderr F at sun.security.validator.Validator.validate(Validator.java:271)
2024-09-10T14:58:07.510742757Z stderr F at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315
-- Verify if above host cert seen in kafka log is matching with host cert on ESXi.
[root@ESXi:~] openssl x509 -noout -text -in /etc/vmware/nsx/host-cert.pem
[Output Truncated]
Validity
Not Before: Aug 14 17:04:26 2024 GMT
Not After : Nov 17 17:04:26 2026 GMT
VMware NSX-T Data Center 3.x
VMware NSX-T Data Center 4.x
NSX Application Platform 4.1.2
NSX Application Platform 4.2.0
This is known issue affecting NSX Application Platform 4.1.2. This happens when nsx-exporter on Esxi host is unable to identify the recently generated certificates and there is a mismatch of host certificate and host certificate on kafka.
nc -zv <Ingress URL> 443
nc -zv <Messaging URL> 9092
1. /etc/init.d/nsx-exporter restart
2. /etc/init.d/nsx-opsagent restart
3. /etc/init.d/netopad restart