Metrics Delivery Failure alarm related to Security Services Platform (SSP) in NSX UI
search cancel

Metrics Delivery Failure alarm related to Security Services Platform (SSP) in NSX UI

book

Article ID: 389600

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Issue

The Metrics Delivery Failure alarm in the NSX UI indicates that metrics are failing to transmit from NSX components (Manager, Transport Nodes, or Edge Nodes) to the Security Services Platform (SSP).

Impact

  • Metrics are not available in SSP dashboards or APIs.
  • Visibility into health, performance, and usage is reduced.
  • Alarms may repeatedly trigger until sync is restored.

Maintenance Window

Not required. Most remediation steps are non-disruptive.
However, restarting key services (e.g. SHA, proton, netopad) may briefly impact metric collection.

Environment

  • vDefend SSP 5.0 or later

Cause

There are multiple possible root causes for this alarm. The most common are:

  • Certificate mismatch between NSX and SSP

  • Outdated/stale trust entities in authserver

  • API certificate on NSX Manager rotated but not refreshed in SSP

  • Transport Node (TN) or Edge certificate mismatch

  • Network/firewall or DNS issues preventing TN to SSP communication

Resolution

When a Metrics Delivery Failure alarm is raised in NSX → SSP integration, the troubleshooting depends on the status code included in the alarm.

1. Review the Alarm Details

  1. In the NSX UI, expand the Metrics Delivery Failure alarm.

  2. The alarm description includes a status code (e.g. UNAUTHENTICATED, UNAVAILABLE, PERMISSION_DENIED).

  3. This status code is critical — it determines the troubleshooting path.

  4. Note the affected node(s) and the status code.

The possible status codes are:

  • UNAUTHENTICATED – Certificate sync/authentication issue

  • UNAVAILABLE or DEADLINE_EXCEEDED – Network/DNS/firewall issue

  • PERMISSION_DENIED – Authorization failure on SSP side

 

2. Status Code: UNAUTHENTICATED

This usually indicates a certificate synchronization issue between NSX and SSP.
The SSP cannot authenticate metrics sent by NSX nodes (Manager, Edge, or TN).

Common Causes

  • NSX API certificate was replaced but not updated in SSP truststore.

  • Transport Node (TN) or Edge node certificate changed after SSP deployment.

  • SSP Authserver pod is missing one or more NSX trust entities.

  • SHA agent running with stale certificates.

 

2.1 Quick Remediation – Reset Global Metrics Config

Sometimes stale config causes this. You can refresh it:

  1. Retrieve current config:

    GET https://<NSX_MANAGER_IP>/policy/api/v1/infra/metric-global-config
    
    • Copy the full JSON response.

  2. Disable metrics temporarily:

    • Change "enabled": true"enabled": false.

    PATCH https://<NSX_MANAGER_IP>/policy/api/v1/infra/metric-global-config
    
  3. Wait 1–2 minutes.

  4. Re-enable metrics:

    • Change "enabled": false"enabled": true.

    • Send PATCH again.

This forces NSX Manager and SSP to refresh their metric delivery config.

If still failing, continue below.

 

2.2 Gather Required Data

On impacted TN / Edge Node:

  • Get Node UUID:

    /bin/nsxcli -c get node-uuid
    
  • Save Node Certificate:

    cat /etc/vmware/nsx/host-cert.pem
    

On NSX Manager:

  • Get messaging client certificates:

    GET https://<NSX_MANAGER_IP>/api/v1/messaging/clients
    
    • Match client_id with the Node UUID noted earlier.

    • Check if the certificate matches host-cert.pem.

On SSP UI:

  • Navigate to System → Certificates.

  • Locate certificate named:

    • NSX_UA_TN <NODE_UUID> or

    • NSX_UA_EDGE <NODE_UUID>.

  • Export this certificate.

  • Compare with host-cert.pem.

On SSP Installer (Authserver validation):

  • List authserver pod:

    k get pods -n nsxi-platform | grep authserver
    
  • Restart it:

    k delete pod -n nsxi-platform <authserver-pod-name>
    
  • After restart, check logs for cert sync:

    k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN"
    k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_EDGE"
    
  • To view full certificate in logs:

    k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN" | grep "<first-few-characters-of-cert>" -A 50
    

Validate if the certificate matches the one saved earlier.

 

2.3 Common Issues

Issue 1 – Missing Trust Entity (NSX_UA_TN)

  • If trust entity is missing in authserver config:

    k edit deployment authserver -n nsxi-platform
    
  • Find line:

    --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
    
  • Add missing entity:

    --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
    
  • Authserver will restart and sync certs.

 

Issue 2 – New Edge or ESXi Added After SSP Deployment

  • Validate new node’s cert (/etc/vmware/nsx/host-cert.pem) exists in SSP trust manager.

  • Restart authserver pod to refresh:

    k get pods -n nsxi-platform | grep authserver
    k delete pod -n nsxi-platform <authserver-pod-name>

 

Issue 3 - API cert on NSX Manager has changed after the SSP deployment :

Step 1: Check which certificate SHA agent is using on the host (from NSX Manager side)

  • On NSX 4.2+, you can directly ask the SHA process about its certificates:

    /opt/vmware/nsx-netopa/bin/sha-appctl -c get_napp_certificates
    

    This shows the root certificate and node certificate that SHA is currently using.

  • Alternatively, if you can’t run that, you can check syslog for SHA startup messages:

    zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile"
    

    Those lines show which certificate/profile SHA used when connecting to SSP.
    But if the logs are rotated, you might not find it.

Step 2: Get the API certificate from NSX Manager

  • Every NSX Manager node has its own API certificate (used for management/API communication).

  • For checking it:

    • Log in to NSX Manager UI → System > Certificates.

    • Find the API certificate that belongs to your Manager node (you identify the right Manager node by UUID).

    • Copy that cert’s UUID.

  • Then query it via API:

    GET /api/v1/trust-management/certificates/<certificate ID>
    

    This returns:

    • Full cert (pem_encoded)

    • Thumbprint (leaf_certificate_sha_256_thumbprint)

    • Who is using it (used_by section → service_types: "API")

Compare the node certificate in use by SHA agent (from step 1) vs. the API certificate currently installed on NSX Manager (from step 2).

If the NSX Manager’s API certificate was recently rotated/replaced, SHA might still be holding the old certificate due to which SHA agent cannot authenticate to NSX Manager/SSP correctly.

  • If the SHA cert in use ≠ the current API cert:

    • Restart SHA agent so it re-fetches the updated certificate:

      service nsx-sha restart
      
  • If that doesn’t help (SHA is still stuck), restart proton:

    service proton restart
    

    proton is the higher-level security framework service that manages SHA and related processes — restarting it forces a re-registration of trust.

 

Issue 4 – Transport Node (TN) certificate changed after SSP deployment

  • Each Transport Node (ESXi/Edge) has its own node certificate.

  • The SHA agent uses that cert to prove its identity to SSP.

  • If the TN’s cert changes (for example after a rotation), the SHA agent might still be trying to use the old cert, which no longer matches → authentication fails.

Step 1 – Check which certificate SHA agent is actually using

  • On NSX 9.0 or higher:

    • On a Transport Node (ESXi host):

      /usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
      
    • On an Edge node:

      /opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
      
  • On NSX below 9.0:

    • Search the nsx-syslog logs instead:

      grep -ia nsx-sha nsx-syslog* | grep -ia "NAPP Profile"
      
    • (You may need to unzip archived nsx-syslog bundles first.)

    • If logs are rotated, you may not find it.

Step 2 – Check what the current node certificate really is

cat /etc/vmware/nsx/host-cert.pem

This shows the effective Transport Node certificate.

Step 3 – Compare

  • If the SHA agent is using a different cert than the current TN cert → mismatch detected.

Step 4 – Fix the mismatch

Restart services so SHA re-reads the correct certificate:

  1. Restart SHA agent:

    /etc/init.d/netopad restart
    
  2. Restart exporter:

    /etc/init.d/nsx-exporter restart
    

Step 5 – If still not fixed

Do a full sync of trust between NSX and SSP:

  1. Restart proton (leader/common agent on NSX Manager):

    systemctl restart proton
    

    → This forces NSX Manager to re-sync all certs to transport nodes.
    → Wait a few minutes for the sync to complete.

  2. Restart authserver on SSP side:

    kubectl rollout restart deployment authserver
    

    → This makes SSP reload the updated certificate from trust manager.

 

3. Status Code: UNAVAILABLE / DEADLINE_EXCEEDED

  1. Check network/firewall

    • Make sure there’s no firewall blocking traffic from TN → SSP FQDN on TCP 443.

    • Reference required ports: Broadcom Ports Guide.

  2. Check SSP registration info (from NSX Manager API):

    GET /api/v1/infra/sites/napp/registration
    

    Look at the response:

    • ingress_ip_address → should match your SSP FQDN.

    • Confirm this matches what nodes are actually using.

  3. Validate DNS resolution
    On the reported TN, ensure DNS resolves the ingress_ip_address (FQDN) to the correct SSP address.

  4. If many nodes are impacted
    Check for a manager disconnection alarm:
    nsx_application_platform_communication.manager_disconnected
    → Fix that first, because it breaks communication for all TNs.

 

4. Status Code: PERMISSION_DENIED

  1. Check envoy logs (SSP ingress proxy)

    • Get envoy pod name:

      k get pods -n projectcontour
      
    • View envoy logs:

      k logs <pod-name> -c envoy -n projectcontour
      
  2. Look for API response flags
    Example log:

    "POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX ...
    • UAEX = UnauthorizedExternalService
      → usually means the auth-server pod is down.

  3. Check auth-server pod status

    k get pods -n nsxi-platform | grep auth
    
    • If it’s not running → contact support for deeper investigation.

 

5. General Fix – Restart SHA Agent

If above checks don’t resolve:

  • On NSX Manager / Edge:

    service nsx-sha restart
    
  • On ESXi host:

    /etc/init.d/netopad restart