Metrics Delivery Failure alarm related to Security Services Platform (SSP) in NSX UI

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Issue

The Metrics Delivery Failure alarm in the NSX UI indicates that metrics are failing to transmit from NSX components (Manager, Transport Nodes, or Edge Nodes) to the Security Services Platform (SSP).

Impact

Metrics are not available in SSP dashboards or APIs.
Visibility into health, performance, and usage is reduced.
Alarms may repeatedly trigger until sync is restored.

Maintenance Window

Not required. Most remediation steps are non-disruptive.
However, restarting key services (e.g. SHA, proton, netopad) may briefly impact metric collection.

Environment

vDefend SSP 5.0 or later

Cause

There are multiple possible root causes for this alarm. The most common are:

Certificate mismatch between NSX and SSP
Outdated/stale trust entities in authserver
API certificate on NSX Manager rotated but not refreshed in SSP
Transport Node (TN) or Edge certificate mismatch
Network/firewall or DNS issues preventing TN to SSP communication

Resolution

When a Metrics Delivery Failure alarm is raised in NSX → SSP integration, the troubleshooting depends on the status code included in the alarm.

1. Review the Alarm Details

In the NSX UI, expand the Metrics Delivery Failure alarm.
The alarm description includes a status code (e.g. UNAUTHENTICATED, UNAVAILABLE, PERMISSION_DENIED).
This status code is critical — it determines the troubleshooting path.
Note the affected node(s) and the status code.

The possible status codes are:

UNAUTHENTICATED – Certificate sync/authentication issue
UNAVAILABLE or DEADLINE_EXCEEDED – Network/DNS/firewall issue
PERMISSION_DENIED – Authorization failure on SSP side

2. Status Code: UNAUTHENTICATED

This usually indicates a certificate synchronization issue between NSX and SSP.
The SSP cannot authenticate metrics sent by NSX nodes (Manager, Edge, or TN).

Common Causes

NSX API certificate was replaced but not updated in SSP truststore.
Transport Node (TN) or Edge node certificate changed after SSP deployment.
SSP Authserver pod is missing one or more NSX trust entities.
SHA agent running with stale certificates.

2.1 Quick Remediation – Reset Global Metrics Config

Sometimes stale config causes this. You can refresh it:

Retrieve current config:
GET https://<NSX_MANAGER_IP>/policy/api/v1/infra/metric-global-config
Copy the full JSON response.
Disable metrics temporarily:

Change "enabled": true → "enabled": false.
PATCH https://<NSX_MANAGER_IP>/policy/api/v1/infra/metric-global-config
Wait 1–2 minutes.

Re-enable metrics:

Change "enabled": false → "enabled": true.

Send PATCH again.

This forces NSX Manager and SSP to refresh their metric delivery config.

If still failing, continue below.

2.2 Gather Required Data

On impacted TN / Edge Node:

Get Node UUID:
/bin/nsxcli -c get node-uuid
Save Node Certificate:
cat /etc/vmware/nsx/host-cert.pem

On NSX Manager:

Get messaging client certificates:
GET https://<NSX_MANAGER_IP>/api/v1/messaging/clients
Match client_id with the Node UUID noted earlier.

Check if the certificate matches host-cert.pem.

On SSP UI:

Navigate to System → Certificates.
Locate certificate named:
- NSX_UA_TN <NODE_UUID> or
- NSX_UA_EDGE <NODE_UUID>.
Export this certificate.
Compare with host-cert.pem.

On SSP Installer (Authserver validation):

List authserver pod:

k get pods -n nsxi-platform | grep authserver

Restart it:

k delete pod -n nsxi-platform <authserver-pod-name>

After restart, check logs for cert sync:

k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN"
k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_EDGE"

To view full certificate in logs:

k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN" | grep "<first-few-characters-of-cert>" -A 50

Validate if the certificate matches the one saved earlier.

2.3 Common Issues

Issue 1 – Missing Trust Entity (`NSX_UA_TN`)

If trust entity is missing in authserver config:
k edit deployment authserver -n nsxi-platform
Find line:
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
Add missing entity:
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
Authserver will restart and sync certs.

Issue 2 – New Edge or ESXi Added After SSP Deployment

Validate new node’s cert (/etc/vmware/nsx/host-cert.pem) exists in SSP trust manager.

Restart authserver pod to refresh:

k get pods -n nsxi-platform | grep authserver
k delete pod -n nsxi-platform <authserver-pod-name>

Issue 3 - API cert on NSX Manager has changed after the SSP deployment :

Step 1: Check which certificate SHA agent is using on the host (from NSX Manager side)

On NSX 4.2+, you can directly ask the SHA process about its certificates:
/opt/vmware/nsx-netopa/bin/sha-appctl -c get_napp_certificates
This shows the root certificate and node certificate that SHA is currently using.
Alternatively, if you can’t run that, you can check syslog for SHA startup messages:
zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile"
Those lines show which certificate/profile SHA used when connecting to SSP.
But if the logs are rotated, you might not find it.

Step 2: Get the API certificate from NSX Manager

Every NSX Manager node has its own API certificate (used for management/API communication).

For checking it:

Log in to NSX Manager UI → System > Certificates.

Find the API certificate that belongs to your Manager node (you identify the right Manager node by UUID).

Copy that cert’s UUID.
Then query it via API:
GET /api/v1/trust-management/certificates/<certificate ID>
This returns:

Full cert (pem_encoded)

Thumbprint (leaf_certificate_sha_256_thumbprint)

Who is using it (used_by section → service_types: "API")

Compare the node certificate in use by SHA agent (from step 1) vs. the API certificate currently installed on NSX Manager (from step 2).

If the NSX Manager’s API certificate was recently rotated/replaced, SHA might still be holding the old certificate due to which SHA agent cannot authenticate to NSX Manager/SSP correctly.

If the SHA cert in use ≠ the current API cert:
Restart SHA agent so it re-fetches the updated certificate:
service nsx-sha restart
If that doesn’t help (SHA is still stuck), restart proton:
service proton restart
proton is the higher-level security framework service that manages SHA and related processes — restarting it forces a re-registration of trust.

Issue 4 – Transport Node (TN) certificate changed after SSP deployment

Each Transport Node (ESXi/Edge) has its own node certificate.
The SHA agent uses that cert to prove its identity to SSP.
If the TN’s cert changes (for example after a rotation), the SHA agent might still be trying to use the old cert, which no longer matches → authentication fails.

Step 1 – Check which certificate SHA agent is actually using

On NSX 9.0 or higher:
On a Transport Node (ESXi host):
/usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
On an Edge node:
/opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
On NSX below 9.0:
Search the nsx-syslog logs instead:
grep -ia nsx-sha nsx-syslog* | grep -ia "NAPP Profile"
(You may need to unzip archived nsx-syslog bundles first.)

If logs are rotated, you may not find it.

Step 2 – Check what the current node certificate really is

This shows the effective Transport Node certificate.

Step 3 – Compare

If the SHA agent is using a different cert than the current TN cert → mismatch detected.

Step 4 – Fix the mismatch

Restart services so SHA re-reads the correct certificate:

Restart SHA agent:
/etc/init.d/netopad restart
Restart exporter:
/etc/init.d/nsx-exporter restart

Step 5 – If still not fixed

Do a full sync of trust between NSX and SSP:

Restart proton (leader/common agent on NSX Manager):
systemctl restart proton
→ This forces NSX Manager to re-sync all certs to transport nodes.
→ Wait a few minutes for the sync to complete.
Restart authserver on SSP side:
kubectl rollout restart deployment authserver
→ This makes SSP reload the updated certificate from trust manager.

3. Status Code: UNAVAILABLE / DEADLINE_EXCEEDED

Check network/firewall

Make sure there’s no firewall blocking traffic from TN → SSP FQDN on TCP 443.

Reference required ports: Broadcom Ports Guide.
Check SSP registration info (from NSX Manager API):
GET /api/v1/infra/sites/napp/registration
Look at the response:

ingress_ip_address → should match your SSP FQDN.

Confirm this matches what nodes are actually using.
Validate DNS resolution
On the reported TN, ensure DNS resolves the ingress_ip_address (FQDN) to the correct SSP address.

If many nodes are impacted
Check for a manager disconnection alarm:
nsx_application_platform_communication.manager_disconnected
→ Fix that first, because it breaks communication for all TNs.

4. Status Code: PERMISSION_DENIED

Check envoy logs (SSP ingress proxy)
Get envoy pod name:
k get pods -n projectcontour
View envoy logs:
k logs <pod-name> -c envoy -n projectcontour
Look for API response flags
Example log:

"POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX ...

UAEX = UnauthorizedExternalService
→ usually means the auth-server pod is down.
Check auth-server pod status
k get pods -n nsxi-platform | grep auth
If it’s not running → contact support for deeper investigation.

5. General Fix – Restart SHA Agent

If above checks don’t resolve:

On NSX Manager / Edge:
service nsx-sha restart
On ESXi host:
/etc/init.d/netopad restart