Metrics are not delivered to NSX+ after NSX Manager on 4.1.1 is onboarded to NSX+
book
Article ID: 345782
calendar_today
Updated On:
Products
VMware NSX Networking
Issue/Introduction
Symptoms: - NSX Manager is on 4.1.1, and has been onboarded to NSX+ - Manager is using a self-signed certificate - The hostname configured on the NSX Manager node is "nsx-manager"
A transmission failure is observed in the logs. Around 2 minutes after LM is onboarded to NSX+, there will be an alarm raised about metrics delivery failure. In the metrics delivery failure Alarm description, the target address is the IP address of the Manager node on which this issue is observed.
Logging example in syslog on the Manager node: zgrep 'Failed to send one msg' /var/log/syslog*
20xx-xx-xxTxx:xx:xx.xxxZ nsx-manager NSX 255969 - [nsx@6876 comp="nsx-manager" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="tsdb-sender-metrics_mux"] Failed to send one msg timestamp: 1#012entity: POLICY_EDGE_NODE#012entity_id: "<UUID>"#012node_id: "<UUID>"#012nsx_site_id: "<UUID>"#012tenant_id: "<ID>"#012org_id: "<UUID>"#012system {#012 obj_id: "1"#012 mem_available: 1#012}#012:#012 <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>#012 Traceback (most recent call last):#012 File "/opt/vmware/nsx-netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 596, in send_metrics#012 response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 946, in __call__#012 return _end_unary_response_blocking(state, call, False, None)#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking#012 raise _InactiveRpcError(state)#012grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>
Another logging example from syslog: zgrep "overwrite_target is " /var/log/syslog*
Transmission fails during handshake with Local Manager when self-signed certificate is missing the X509v3 Subject Alternative Name.
This can be verified with the below steps:
1. Get the certificate ID from the NSX UI: a. Login to the NSX Manager UI, navigate to System > Certificates b. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID) c. Expand this certificate item, note its UUID
2. Use API GET /api/v1/trust-management/certificates/[certificate ID from Step 1]
3. Copy the pem_encoded value from the above output to a text file. Note: After copying the content, the '\n' should be removed from the string and keep header and tail in different lines.
5. Verify if the DNS is empty under X509v3 Subject Alternative Name section (bolded above). If so, then the transmission failure is because of the certificate: X509v3 Subject Alternative Name: DNS <-----------------------empty
Resolution
This issue will be resolved in a later NSX 4.1.2
Workaround: 1. Generate a new self-signed certificate using the steps below and note the certificate ID:: System → Certificates → Generate → Self Signed Certificate
Note: In generating the certificate, uncheck the Service Certificate option and set the Common Name as the IP address of the manager node or FQDN to which the certificate is to be applied in the next step. If using FQDN, ensure the FQDN can be looked up by checking if the FQDN is included in the file /etc/hosts on every node present in the topology.
2. Apply the new certificate using the following API to remove the reference to the former certificate
The Manager Node id is node UUID of the NSX Manager Node on which this issue was observed.
3. Verify that the the new certificate is applied to the manager node. Refresh the UI, return to the new certificate and check if the Where Used field's value is now greater than zero and used by the manager node.
Additional Information
Impact/Risks: Metrics cannot be delivered to NSX+. Some of the metrics on the NSX+ User Interface and API will not be available.