Metrics Delivery Failure alarm related to Security Services Platform(SSP) is seen on the NSX UI
search cancel

Metrics Delivery Failure alarm related to Security Services Platform(SSP) is seen on the NSX UI

book

Article ID: 389600

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Purpose: Indicates continuous transmission failure of metrics to Security Services Platform (SSP)
Impact: Dashboard and API will not show the metrics from these nodes
Maintenance window required for remediation? No

Environment

vDefend SSP >= 5.0

Cause

There is a problem delivering metrics to SSP

Resolution

Security Services Platform (SSP) should be deployed

        • Get the status code for the delivery failure, It is included in the alarm description
        • Known issues and remediation based on status code:
          • UNAUTHENTICATED
            • Steps to validate the certs in sync between SSP / TN or Edge / NSX :

              SSP:
              Check the certs added to authserver from trust manager
                  Log into SSPI root shell
                  k get pods -n nsxi-platform | grep authserver

                  k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN"
                  k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_EDGE"

              TN / Edge node:
              The following command can be used on the TN/Edge node to get the host cert which has to match from SSP trust manager certs too. 

              cat /etc/vmware/nsx/host-cert.pem

              We can get the UUID of the TN node using the following CLI command get node-uuid . This should match with the alias of the TN/Edge node from trust-manager get certs API  call.

              Example :

              On TN node : 
              >>  /bin/nsxcli -c get node-uuid

              On Edge node : 
              >> su admin 
              >> get node-uuid

              Trust-manager : 

              From the previous step, we can extract the node’s UUID for TN/Edge node.

              Now in the SSP UI by navigating to System → Certificates, we can match with the certificate Name column which is in the format of NSX_UA_TN <NODE_UUID> or NSX_UA_EDGE <NODE_UUID>.

              Export the corresponding certificate and verify it matches with host-cert.pem certificate retrieved from TN/Edge node in previous step.

              For Example:



              Following command is to check the certificates used by SHA agent:

              TN:

              /usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp

              {"type": "napp", "status": {"10.xx.xx.xx:443": {"UPM Profile": "Received", "Global Config": "Enabled", "Metric Stub": "Created", "ingress_certificate": "-----BEGIN CERTIFICATE-----......----END CERTIFICATE-----\n"", "certificate_chain": "-----BEGIN CERTIFICATE-----......-----END CERTIFICATE-----\n"}}}

              Edge:

              /opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp

              {"type": "napp", "status": {"10.xx.xx.xx:443": {"UPM Profile": "Received", "Global Config": "Enabled", "Metric Stub": "Created", "ingress_certificate": "-----BEGIN CERTIFICATE-----......----END CERTIFICATE-----\n"", "certificate_chain": "-----BEGIN CERTIFICATE-----......-----END CERTIFICATE-----\n"}}}

              NSX manager: 
              We can query the TN certs present in NSX manager using the following API. 

              GET "https://<NSX_MANAGER_IP>/api/v1/messaging/clients"

               

              Common agent is the service which takes care of pushing the certs into trust manager on the SSP side. In order to check if common agent has synced properly we can check the logs of /var/log/proton/nsxapi.log around the time, the TN/Edge node was added. 
              Steps to identify the leader node of common agent service: 

              Figure out which of the 3 manager nodes, has common agent leadership role.  The following command would give tell which node is common agent leader with node id.
                       1.  su admin -c get clus stat verb | grep "COMMON_AGENT_SERVICE" 
                       2. To figure which node from the id ,   su admin -c get clus stat

               

              Issue 1 - The NSX_UA_TN missing in authserver :

              On SSP side :
               
              Log into SSPI root shell

              k edit deployment authserver -n nsxi-platform

              Then search for the below line:
                                  --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
              Edit the line and add the missing entity ID:
                                  --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
                      

              with this, the authserver should restart and sync the entity certificates from trust manager, then the auth issue on SSP side should get resolved. To validate, we can check if the certs in trust manager and auth server are in sync. 

               

              Issue 2 - For every new edge or esxi that gets added on NSX after SSP deployment:

              Validate if the edge/TN node cert ( /etc/vmware/nsx/host-cert.pem) is present inside the SSP trust manager certs. Once we validate that, we should restart authserver pod to refresh authserver to get the certs. 


              Issue 3 - API cert on NSX Manager has changed after the SSP deployment :


              Check certificate which is in use for SHA agent on NSX manager:

              NSX version is equal to or higher than 4.2:
                  Get the root certificate and node certificate via command `/opt/vmare/nsx-netopa/bin/sha-appctl -c get_napp_certificates`.
                  Get the root certificate and node certificate by searching syslog via command `zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile" `.

              It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps.

              Get API certificate:

              Get the certificate ID from the NSX UI: 
              a. Login to the NSX Manager UI, navigate to System > Certificates
              b. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID)
              c. Expand this certificate item, note its UUID

              Use API GET /api/v1/trust-management/certificates/[certificate ID from last step] 

              curl -k -i -H "Accept: application/json" -u admin -X GET https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID>
              {
                  "pem_encoded": "-----BEGIN CERTIFICATE-----XXXXXXXXXXXX-----END CERTIFICATE-----\n",
                  "has_private_key": true,
                  "used_by": [
                              {
                                  "node_id": "<UUID>",
                                  "service_types": [
                                      "API"
                                  ]
                              }],
                  "leaf_certificate_sha_256_thumbprint": "XX:XX:XX:XX:XX:XX",
                  "resource_type": "certificate_self_signed",
                  "id": "<UUID>",
                  "display_name": "API certificate for node <UUID>",
                  "_create_time": 1690371634010,
                  "_create_user": "system",
                  "_last_modified_time": 1690377045873,
                  "_last_modified_user": "admin",
                  "_system_owned": false,
                  "_protection": "NOT_PROTECTED",
                  "_revision": 2
              }

               

              If the node certificate is different from the updated API certificate, restart SHA agent via command `service nsx-sha restart`.
              If restarting SHA agent does not work, please restart proton via command `service proton restart`.

              Issue 4 - TN certificate has changed after the SSP deployment: 

              Check node certificate which is in use for SHA agent on TN:
              NSX version is equal to or higher than 9.0:
               TN:
              /usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp


              Edge:

              /opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp

              NSX version is lower than 9.0:
               
               Get the root certificate and node certificate by searching nsx-syslog:
               Copy all of the nsx-syslog files to a temporay directory and unzip the compressed logs
               Search the log to dump the SSP file via command `grep -ia nsx-sha nsx-syslog* | grep -ia "NAPP Profile"`

              It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps.

              Check the current node certificate on the TN:
              cat /etc/vmware/nsx/host-cert.pem
              If the node certificate used by SHA agent is different from the current effective certificate on the TN, restart the affected services:
              SHA agent - /etc/init.d/netopad restart
              Exporter - /etc/init.d/nsx-exporter restart

              Last step of remediation would be to restart proton (common agent leader node) , which will force full sync. Give a few minutes for common agent to full sync, and then restart auth server on SSP side. This will resync the cert from trust manager and will resolve the issue. 

              restart proton service on NSX : 
              Log on to the NSX Manager
              systemctl restart proton  

              restart authserver on SSP side : 
              Log into SSPI root shell
              k rollout restart deployment authserver

        •  
          • UNAVAILABLE or DEADLINE_EXCEED:  
            • If there is extra firewall between TN node(including EDGE node and ESXi node) and NAPP platform, please check if there is policy denying the traffic from TN to NAPP, since the traffic requires HTTPS(443). Refer to https://ports.broadcom.com/home/VmwarevDefend for more details on needed open ports.
            • Login to the NSX Manager managing this node, get the profile using API /api/v1/infra/sites/napp/registration,
            • Example:
              {
                  "napp_registration_results": [
                      {
                          "cluster_id": "4ad2272c-####-####-####-##########e1",
                          "cluster_name": "NSX Application Platform",
                         "message_bus_ip_address": "192.xx.xx.xx",
                          "ingress_ip_address": "napp.example.com",
                          "status": "DEPLOYMENT_SUCCESSFUL",
                          "is_intelligence_enabled": false,
                          "is_metric_enabled": true,
                          "resource_type": "NappRegistration",
                          "id": "4ad2272c-####-####-####-##########e1",
                          "display_name": "4ad2272c-####-####-####-##########e1",
                          "path": "/infra/settings/napp/napp-appliance-info/4ad2272c-####-####-####-##########e1",
                          "relative_path": "4ad2272c-####-####-####-##########e1",
                          "parent_path": "/infra",
                          "remote_path": "",
                          "unique_id": "4b5268b8-####-####-####-##########4e",
                          "realization_id": "4b5268b8-####-####-####-##########4e",
                          "owner_id": "2313c7d2-####-####-####-##########e5",
                          "marked_for_delete": false,
                          "overridden": false,
                          "_create_time": 1709187097362,
                          "_create_user": "nsx_policy",
                          "_last_modified_time": 1709187320744,
                          "_last_modified_user": "nsx_policy",
                          "_system_owned": false,
                          "_protection": "REQUIRE_OVERRIDE",
                          "_revision": 1
                      }
                  ]
              }

               

            • Check if the "ingress_ip_address" field in the API's response is same as the SSP FQDN,
            • Check and fix the connectivity from "Reported by node" to SSP FQDN,
            • Check if DNS lookup result for "ingress_ip_address" in the API's response is the real Security Services Platform (SSP) address on "Reported by node".
            • If you are seeing a lot of these alarms from various Edges, ESX and UA nodes check if there is an alarm for "nsx_application_platform_communication.manager_disconnected". Follow the remediation for this alarm first.    
              • PERMISSION_DENIED:
                • Check the API response from envoy log:
                  • Get the projectcontour pod name and get envoy log from this pod using following commands 
                    • Log into SSPI root shell 
                    • k get pods -n projectcontour
                      • Example:
                         
                        root@nsx-mgr-1:~# napp-k get pods -n projectcontour
                        NAME                                      READY   STATUS    RESTARTS   AGE
                        projectcontour-contour-77669b45bd-lqdx4   1/1     Running   0          9d
                        projectcontour-envoy-8qjqz                2/2     Running   0          9d
                        projectcontour-envoy-dgqc6                2/2     Running   0          9d
                        projectcontour-envoy-pssjw                2/2     Running   0          9d
                        projectcontour-envoy-pt9fk                2/2     Running   0          9d
                        projectcontour-envoy-qn79h                2/2     Running   0          9d
                        projectcontour-envoy-wchcd                2/2     Running   0          9d
                         
                    • k logs {pod-name} -c envoy -n projectcontour 
                      • Example:root@nsx-mgr-1:~# k logs projectcontour-envoy-8qjqz -c envoy -n projectcontour

                  • In this log, user would get API response's flag.
                    • Example:

                      [2024-05-31T08:55:21.662Z] "POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX 0 0 0 - "10.xx.xx.xx" "grpc-python/1.47.5 grpc-c/25.0.0 (linux; chttp2)" "57526c59-####-####-####-##########ae" "cloudnative.nsbucqesystem.net:443" "-"

                       

                      • UAEX - stands for UnauthorizedExternalService
                      • If API response flag is UAEX,  the possible reason is auth-server pod stops. To comfirm it, check auth-server pod status using following command:
                        • k get pods -n nsxi-platform | grep auth  
                        • Example: 
                          root@nsx-mgr-1:~# k get pods -n nsxi-platform | grep auth
                          authserver-6bc9b59ddd-sbc5g                                       1/1     Running     0               7d4h

                      • If auth-server pod stops, reach out to support for further investigation.
      • If above remediation does not resolve the alarm, restart SHA agent on "Reported by node":
        • If it is NSX Manager or NSX Edge: service nsx-sha restart
        • If it is ESXi: /etc/init.d/netopad restart