Metrics are not delivered to NSX+ after NSX Manager on 4.1.1 is onboarded to NSX+

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
- NSX Manager is on 4.1.1, and has been onboarded to NSX+
- Manager is using a self-signed certificate
- The hostname configured on the NSX Manager node is "nsx-manager"

A transmission failure is observed in the logs. Around 2 minutes after LM is onboarded to NSX+, there will be an alarm raised about metrics delivery failure. In the metrics delivery failure Alarm description, the target address is the IP address of the Manager node on which this issue is observed.

Logging example in syslog on the Manager node:
zgrep 'Failed to send one msg' /var/log/syslog*

20xx-xx-xxTxx:xx:xx.xxxZ nsx-manager NSX 255969 - [nsx@6876 comp="nsx-manager" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="tsdb-sender-metrics_mux"] Failed to send one msg timestamp: 1#012entity: POLICY_EDGE_NODE#012entity_id: "<UUID>"#012node_id: "<UUID>"#012nsx_site_id: "<UUID>"#012tenant_id: "<ID>"#012org_id: "<UUID>"#012system {#012 obj_id: "1"#012 mem_available: 1#012}#012:#012 <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>#012 Traceback (most recent call last):#012 File "/opt/vmware/nsx-netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 596, in send_metrics#012 response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 946, in __call__#012 return _end_unary_response_blocking(state, call, False, None)#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking#012 raise _InactiveRpcError(state)#012grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>

Another logging example from syslog:
zgrep "overwrite_target is " /var/log/syslog*

20xx-xx-xxTxx:xx:xx.xxxZ nsx-manager NSX 2339127 - [nsx@6876 comp="nsx-manager" subcomp="ip_utils" username="nsx-sha" level="INFO" s2comp="tsdb-cert"] overwrite_target is None

Cause

Transmission fails during handshake with Local Manager when self-signed certificate is missing the X509v3 Subject Alternative Name.

This can be verified with the below steps:

1. Get the certificate ID from the NSX UI:
a. Login to the NSX Manager UI, navigate to System > Certificates
b. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID)
c. Expand this certificate item, note its UUID

2. Use API GET /api/v1/trust-management/certificates/[certificate ID from Step 1]

curl -k -i -H "Accept: application/json" -u admin -X GET https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID>

{
"pem_encoded": "-----BEGIN CERTIFICATE-----\nMIIrdzCCAl+gAwIBAgIJAP/fpd0dacyuMA0GCSqGSIb3DQEBCwUAMGgxFDASBgNV\nBAMMC25zeC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMC1ZNd2FyZSBJ\nbmMuMRIwEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJV\nUzAWerSWQertQxMTM2MzVaFw0yNTEwMjgxMTM2MzVaMGgxFDASBgNVBAMMC25z\neC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMC1ZNd2FyZSBJbmMuMRIw\nEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJVUzCCASIw\nDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMJbSP3uVlVmkyp0pR95+ppbuTaL\ncvxMgCxvgTQ0LifTJFX3wraatPXgwBo4r8cpXeP+aLn/KxR8jTWnXbZyK+4ssaW9\n/X5tFYabn0TNyQl6aPO03mmJNLZOKUcIXXP7DKtkt6TEaWH1X4C45gxteXponbc9\nCFnVmArci0pkkBFng+l9fASu35P4LuBkHmFbspOA23JNmCCTtvW0n+Ry0NqP6mw8\nbRqAkymlQI6q2aVPcPUChmptdNqx1gEWXnGaxlfK2hu6tvTnC4jeTirG5Yepv2yP2n5DTGog0GPHP1k9f7bQkNwDkQ7YlvC3AvJUt/b4b3WMeWlnHhEMyFEwMmDaECAwEA\nAaMkMCIwEwYDVR0lBAwwCgYIKwYBBQUHAwEwCwYDVR0RBAQwAoIAMA0GCSqGSIb3\nDQEBCwUAA4IBAQBQq/XN1HkYAnENYkxlwjuzlxzDkYsnr82E7PEVyJ5yP4m6sF85\nzz5FsCt7Y6kHt+2xrgNi0UHuSByvYxtkOzBTOqqLoDltBEng+HOTW2Cd6zD+xvHL\nxg41K6ykfMvjBc+wb+h2JQfUiL8yh10g4Uvpv1HKtCCUJb00kLRK4TIm5+KHtIB4\nF8uWHtwBnz93G0PO1/K89gybQgy+WitjM0NYExytIiLcWVETQc9rVd2ubLxxExJ3\nBxQsEMOviB6I6KjCmDtk69vOSvrZGXxUBQhve3BQku44jWVUg5AWJRZKm9sRMAHu\nOyE2ycIrToxsFuiwpWOzyTMReq2NQIuh0F2Q\n-----END CERTIFICATE-----\n",
"has_private_key": true,
"used_by": [
{
"node_id": "<UUID>",
"service_types": [
"API"
]
}],
"leaf_certificate_sha_256_thumbprint": "15:B5:93:F0:35:77:91:3B:22:B6:D3:24:6F:F1:9D:15:DE:4E:D3:C4:EB:51:2D:D2:0D:66:D1:65:2B:7F:18:BE",
"resource_type": "certificate_self_signed",
"id": "<UUID>",
"display_name": "API certificate for node <UUID>",
"_create_time": 1690371634010,
"_create_user": "system",
"_last_modified_time": 1690377045873,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 2
}

3. Copy the pem_encoded value from the above output to a text file.
Note: After copying the content, the '\n' should be removed from the string and keep header and tail in different lines.

For example:
cat ca.pem

-----BEGIN CERTIFICATE-----
MIIrdzCCAl+gAwIBAgIJAP/fpd0dacyuMA0GCSqGSIb3DQEBCwUAMGgxFDASBgNVBAMMC25zeC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMCQWer2FyZSBJbmMuMRIwEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJVUzAeFw0yMzA3MjYxMTM2MzVaFw0yNTEwMjgxMTM2MzVaMGgxFDASBgNVBAMMC25zeC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMC1ZNd2FyZSBJbmMuMRIwEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJVUzCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMJbSP3uVlVmkyp0pR95+ppbuTaLcvxMgCxvgTQ0LifTJFX3wraatPXgwBo4r8cpXeP+aLn/KRER8jTWnXbZyK+4ssaW9/X5tFYabn0TNyQl6aPO03mmJNLZOKUcIXXP7DKtkt6TEaWH1X4C45gxteXponbc9CFnVmArci0pkkBFng+l9fASu35P4LuBkHmFbspOA23JNmCCTtvW0n+Ry0NqP6mw8bRqAkymlQI6q2aVPcPUChmptdNqx1gEWXnGaxlfK2hu6tvTnC4jeTirG5Yepv2yP5DTGog0GPHP1k9f7bQkNwDkQ7YlvC3AvJUt/b4b3WMeWlnHhEMyFEwMmDaECAwEAAaMkMCIwaeYDVR0lBAwwCgYIKwYBBQUHAwEwCwYDVR0RBAQwAoIAMA0GCSqGSIb3DQEBCwUAA4IBAQBQq/XN1HkYAnENYkxlwjuzlxzDkYsnr82E7PEVyJ5yP4m6sF85zz5FsCt7Y6kHt+2xrgNi0UHuSByvYxtkOzBTOqqLoDltBEng+HOmeKCd6zD+xvHLxg41K6ykfMvjBc+wb+h2JQfUiL8yh10g4Uvpv1HKtCCUJb00kLRK4TIm5+KHtIB4F8uWHtwBnz93G0PO1/K89gybQgy+WitjM0NYExytIiLcWVETQc9rVd2ubLxxExJ3BxQsEMOviB6I6KjCmDtk69vOSvrZGXxUBQhve3BQku44jWVUg5AWJRZKm9sRMAHuOyE2ycIrToxsFuiwpWOzyTMReq2NQIuh0F2Q
-----END CERTIFICATE-----

4. Use openssl x509 to decode the certificate file from step 3:
openssl x509 -in /tmp/ca.pem -text -noout

Certificate:
Data:
Version: 3 (0x2)
Serial Number:
<MAC>
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=nsx-manager, OU=NSX, O=VMware Inc., L=Palo Alto, ST=CA, C=US
Validity
Not Before: Jul 26 11:36:35 2023 GMT
Not After : Oct 28 11:36:35 2025 GMT
Subject: CN=nsx-manager, OU=NSX, O=VMware Inc., L=Palo Alto, ST=CA, C=US
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
00:c2:5b:48:fd:ee:56:55:66:93:2a:74:a5:1f:79:
fa:9a:5b:b9:36:8b:72:fc:4c:80:2c:6f:81:34:34:
2e:27:d3:24:55:f7:c2:b6:9a:b4:f5:e0:c0:1a:38:
af:c7:29:5d:e3:fe:68:b9:ff:2b:14:7c:8d:35:a7:
5d:b6:72:2b:ee:2c:b1:a5:bd:fd:7e:6d:15:86:9b:
9f:44:cd:c9:09:7a:68:f3:b4:de:69:89:34:b6:4e:
29:47:08:5d:73:fb:0c:ab:64:b7:a4:c4:69:61:f5:
5f:80:b8:e6:0c:6d:79:7a:68:9d:b7:3d:08:59:d5:
98:0a:dc:8b:4a:64:90:11:67:83:e9:7d:7c:04:ae:
df:93:f8:2e:e0:64:1e:61:5b:b2:93:80:db:72:4d:
98:20:93:b6:f5:b4:9f:e4:72:d0:da:8f:ea:6c:3c:
6d:1a:80:93:29:a5:40:8e:aa:d9:a5:4f:70:f5:02:
86:6a:6d:74:da:b1:d6:01:16:5e:71:9a:c6:57:ca:
da:1b:ba:b6:f4:e7:0b:88:de:4e:2a:c6:e5:87:a9:
bf:6c:8f:e4:34:c6:a2:0d:06:3c:73:f5:93:d7:fb:
6d:09:0d:c0:39:10:ed:89:6f:0b:70:2f:25:4b:7f:
6f:86:f7:53:c7:96:96:71:e1:10:cc:85:13:03:26:
0d:a1
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Extended Key Usage:
TLS Web Server Authentication
X509v3 Subject Alternative Name:
DNS
Signature Algorithm: sha256WithRSAEncryption
50:ab:f5:cd:d4:79:18:02:71:0d:62:4c:65:c2:3b:b3:97:1c:
c3:91:8b:27:af:cd:84:ec:f1:15:c8:9e:72:3f:89:ba:b0:5f:
39:cf:3e:45:b0:2b:7b:63:a9:07:b7:ed:b1:ae:03:62:d1:41:
ee:48:1c:af:63:1b:64:3b:30:53:3a:aa:8b:a0:39:6d:04:49:
e0:f8:73:a6:78:a0:9d:eb:30:fe:c6:f1:cb:c6:0e:35:2b:ac:
a4:7c:cb:e3:05:cf:b0:6f:e8:76:25:07:d4:88:bf:32:87:5d:
20:e1:4b:e9:bf:51:ca:b4:20:94:25:bd:34:90:b4:4a:e1:32:
26:e7:e2:87:b4:80:78:17:cb:96:1e:dc:01:9f:3f:77:1b:43:
ce:d7:f2:bc:f6:0c:9b:42:0c:be:5a:2b:63:33:43:58:13:1c:
ad:22:22:dc:59:51:13:42:cf:6b:55:dd:ae:6c:bc:71:13:12:
77:07:14:2c:10:c3:af:88:1e:88:e8:a8:c2:98:3b:64:eb:db:
ce:4a:fa:d9:19:7c:54:05:08:6f:7b:70:50:92:ee:38:8d:65:
54:83:90:16:25:16:4a:9b:db:11:30:01:ee:3b:21:36:c9:c2:
2b:4e:8c:6c:16:e8:b0:a5:63:b3:c9:33:11:7a:ad:8d:40:8b:
a1:d0:5d:90

5. Verify if the DNS is empty under X509v3 Subject Alternative Name section (bolded above). If so, then the transmission failure is because of the certificate:
X509v3 Subject Alternative Name:
DNS <-----------------------empty

Resolution

This issue will be resolved in a later NSX 4.1.2

Workaround:
1. Generate a new self-signed certificate using the steps below and note the certificate ID::
System → Certificates → Generate → Self Signed Certificate

Note: In generating the certificate, uncheck the Service Certificate option and set the Common Name as the IP address of the manager node or FQDN to which the certificate is to be applied in the next step. If using FQDN, ensure the FQDN can be looked up by checking if the FQDN is included in the file /etc/hosts on every node present in the topology.

2. Apply the new certificate using the following API to remove the reference to the former certificate

curl -k -i -H "Accept: application/json" -u admin -X POST https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID>?action=apply_certificate\&service_type=API\&node_id=<Manager node ID>

The Manager Node id is node UUID of the NSX Manager Node on which this issue was observed.

3. Verify that the the new certificate is applied to the manager node. Refresh the UI, return to the new certificate and check if the Where Used field's value is now greater than zero and used by the manager node.

Additional Information

Impact/Risks:
Metrics cannot be delivered to NSX+. Some of the metrics on the NSX+ User Interface and API will not be available.