In an VMware NSX environment with NAPP deployed: Failed to deliver metrics from SHA to target NAPP
search cancel

In an VMware NSX environment with NAPP deployed: Failed to deliver metrics from SHA to target NAPP

book

Article ID: 316739

calendar_today

Updated On: 07-01-2024

Products

VMware NSX

Issue/Introduction

Symptoms:
  • You recently deployed NSX Application Platform (NAPP) in version 4.1.1 or upgraded NAPP to version 4.1.1.
  • In the VMware NSX UI under alarms we see 'Metrics Delivery Failure' alarms for ESXi transport nodes:
2023-10-11 09_10_32-Mirna Bakhoum - VMware - 27 new items - Slack.jpg
  • On the ESXi host which the alarm is for, in the /var/run/log/nsx-syslog we see the following Error:
Wa(180) nsx-sha: NSX 2315153 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="WARNING" s2comp="tsdb-sender-napp"] Failed to send one msg node_id: "xxxxxxxx-8635-4487-88bf-xxxxxxxxxxxx"
Wa(180)[+] nsx-sha: timestamp: 1693143034
Wa(180)[+] nsx-sha: health_check_poll: false
Wa(180)[+] nsx-sha: :
Wa(180)[+] nsx-sha: <_InactiveRpcError of RPC that terminated with:
Wa(180)[+] nsx-sha: status = StatusCode.UNAUTHENTICATED
Wa(180)[+] nsx-sha: details = ""
Wa(180)[+] nsx-sha: debug_error_string = "{"created":"@xxxxxxxxxx.480732227","description":"Error received from peer ipv4:192.168.1.1:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"


 


Environment

VMware NSX 4.1.0.2

Cause

Issue 1. The NSX_UA_TN missing in authserver . The NAPP authentication server is missing an entity ID for the transport node, in this case the transport node is an ESXi host. The missing ID leads to metric service being unable to authenticate and therefore fails to deliver metrics.


Issue 2. For every new edge or esxi that gets added on NSX after NAPP deployment 


Issue 3. API cert on NSX Manager has changed after the NAPP deployment

Resolution

This is a known issue impacting VMware NSX NAPP.

Workaround:

Steps to validate the certs in sync between NAPP / TN or Edge / NSX :

NAPP:

Authserver : 

The certs added to authserver from trust manager after restart can be grepped from the following command. 
example cmd : 

      napp-k logs authserver-<podname> | grep "NSX_UA_TN"
      napp-k logs authserver-<podname> | grep "NSX_UA_EDGE"

Trust-manager : 

We can query the certs present in trust manager using the following API. The NSX_UA_TN / NSX_UA_EDGE type cert should be present inside the result of GET call.  In the result,  the alias field represents the UUID of the TN/Edge node, which can be got by executing get node-uuid on TN node. 

GET  https://<NSX_MANAGER_IP>/napp/api/v1/platform/trust-management/certificates

Example TN node cert from trust manager get certs API call. 

        {
            "uuid""45040503-xxxx-xxxx-xxxx-xxxxxxxx",
            "alias""0af35bd1-xxxx-xxxx-xxxx-xxxxxxxx",
            "pem_encoded""-----BEGIN CERTIFICATE-----\nMIIEEDCCAvgCCQC...sFADCXXXXXXshCSk\n-----END CERTIFICATE-----",
            "used_by""NSX_UA_TN"
        },

TN / Edge node:

The following command can be used on the TN/Edge node to get the host cert which has to match from NAPP trust manager certs too. 

cat /etc/vmware/nsx/host-cert.pem

We can get the UUID of the TN node using the following CLI command get node-uuid . This should match with the alias of the TN/Edge node from trust-manager get certs API  call.

Example :

In TN node : 
>>  /bin/nsxcli -c get node-uuid

In Edge node : 
>> su admin 
>> get node-uuid
 

NSX manager side : 

We can query the TN certs present in NSX manager using the following API. 

GET "https://<NSX_MANAGER_IP>/api/v1/messaging/clients"

 

Common agent is the service which takes care of pushing the certs into trust manager on NAPP side. In order to check if common agent has synced properly we can check the logs of /var/log/proton/nsxapi.log around the time, the TN/Edge node was added. 
Steps to identify the leader node of common agent service: 

Figure out which of the 3 manager nodes, has common agent leadership role.  The following command would give tell which node is common agent leader with node id.
         1.  su admin -c get clus stat verb | grep "COMMON_AGENT_SERVICE" 
         2. To figure which node from the id ,   su admin -c get clus stat

 


Issue 1 - The NSX_UA_TN missing in authserver :

On NAPP side :

        •  SSH to the NSX manager as root and run the below:
                    napp-k edit deployment authserver
        • Then search for the below line:
                    --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
        • Edit the line and add the missing entity ID:
                    --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN


        Note: There should be 3 entity ID's: NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
        

        • restart authserver
                    napp-k -n nsxi-platform edit deployment authserver 

(If the napp-k alias is not functional, you can directly point to the kubernetes config file like: kubectl --kubeconfig /config/vmware/napps/.kube/config)

            with this, the authserver should restart and sync the entity certificates from trust manager, then the auth issue on NAPP side should get resolved. To validate, we can check if the certs in trust manager and auth server are in sync. 

 

Issue 2 - For every new edge or esxi that gets added on NSX after NAPP deployment:

Validate if the edge/TN node cert ( /etc/vmware/nsx/host-cert.pem) is present inside the NAPP trust manager certs (GET certs api). Once we validate that, we should restart authserver pod to refresh authserver to get the certs. 

 

  • Issue 3 - API cert on NSX Manager has changed after the NAPP deployment :

      • Check certificate which is in use for SHA agent on NSX manager:
        • Get the root certificate and node certificate by searching syslog via command `zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile" `.
        • It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps.
      • Get API certificate:
        • Get the certificate ID from the NSX UI: 
          • a. Login to the NSX Manager UI, navigate to System > Certificates
          • b. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID)
          • c. Expand this certificate item, note its UUID
        • Use API GET /api/v1/trust-management/certificates/[certificate ID from last step] 
          curl -k -i -H "Accept: application/json" -u admin -X GET https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID>
          {
              "pem_encoded": "-----BEGIN CERTIFICATE-----XXXXXXXXXXXX-----END CERTIFICATE-----\n",
              "has_private_key": true,
              "used_by": [
                          {
                              "node_id": "<UUID>",
                              "service_types": [
                                  "API"
                              ]
                          }],
              "leaf_certificate_sha_256_thumbprint": "XX:XX:XX:XX:XX:XX",
              "resource_type": "certificate_self_signed",
              "id": "<UUID>",
              "display_name": "API certificate for node <UUID>",
              "_create_time": 1690371634010,
              "_create_user": "system",
              "_last_modified_time": 1690377045873,
              "_last_modified_user": "admin",
              "_system_owned": false,
              "_protection": "NOT_PROTECTED",
              "_revision": 2
          }
      • If the node certificate is different from the updated API certificate, restart SHA agent via command `service nsx-sha restart`.
      • If restarting SHA agent does not work, please restart proton via command `service proton restart`.

 

Last step of remediation would be to restart proton (common agent leader node) , which will force full sync. Give a few minutes for common agent to full sync, and then restart auth server on NAPP side. This will resync the cert from trust manager and will resolve the issue. 

restart proton : systemctl restart proton  
restart authserver on napp side : napp-k delete pod  authserver-<podname>