Purpose: Indicates continuous transmission failure of metrics to Security Services Platform (SSP)
Impact: Dashboard and API will not show the metrics from these nodes
Maintenance window required for remediation? No
vDefend SSP >= 5.0
There is a problem delivering metrics to SSP
Security Services Platform (SSP) should be deployed
- UNAUTHENTICATED
Steps to validate the certs in sync between SSP / TN or Edge / NSX :
SSP:
Check the certs added to authserver from trust manager
Log into SSPI root shell
k get pods -n nsxi-platform | grep authserverk logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_TN"
k logs authserver-<podname> -n nsxi-platform | grep "NSX_UA_EDGE"TN / Edge node:
The following command can be used on the TN/Edge node to get the host cert which has to match from SSP trust manager certs too.cat /etc/vmware/nsx/host-cert.pem
We can get the UUID of the TN node using the following CLI command get node-uuid . This should match with the alias of the TN/Edge node from trust-manager get certs API call.
Example :
On TN node :
>> /bin/nsxcli -c get node-uuidOn Edge node :
>> su admin
>> get node-uuid
Trust-manager :From the previous step, we can extract the node’s UUID for TN/Edge node.
Now in the SSP UI by navigating to System → Certificates, we can match with the certificate Name column which is in the format of NSX_UA_TN <NODE_UUID> or NSX_UA_EDGE <NODE_UUID>.
Export the corresponding certificate and verify it matches withhost-cert.pem
certificate retrieved from TN/Edge node in previous step.For Example:
Following command is to check the certificates used by SHA agent:TN:
/usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
{"type": "napp", "status": {"10.xx.xx.xx:443": {"UPM Profile": "Received", "Global Config": "Enabled", "Metric Stub": "Created", "ingress_certificate": "-----BEGIN CERTIFICATE-----......----END CERTIFICATE-----\n"", "certificate_chain": "-----BEGIN CERTIFICATE-----......-----END CERTIFICATE-----\n"}}}
Edge:
/opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
{"type": "napp", "status": {"10.xx.xx.xx:443": {"UPM Profile": "Received", "Global Config": "Enabled", "Metric Stub": "Created", "ingress_certificate": "-----BEGIN CERTIFICATE-----......----END CERTIFICATE-----\n"", "certificate_chain": "-----BEGIN CERTIFICATE-----......-----END CERTIFICATE-----\n"}}}
NSX manager:
We can query the TN certs present in NSX manager using the following API.GET "https://<NSX_MANAGER_IP>/api/v1/messaging/clients"
Common agent is the service which takes care of pushing the certs into trust manager on the SSP side. In order to check if common agent has synced properly we can check the logs of /var/log/proton/nsxapi.log around the time, the TN/Edge node was added.
Steps to identify the leader node of common agent service:Figure out which of the 3 manager nodes, has common agent leadership role. The following command would give tell which node is common agent leader with node id.
1. su admin -c get clus stat verb | grep "COMMON_AGENT_SERVICE"
2. To figure which node from the id , su admin -c get clus stat
Issue 1 - The NSX_UA_TN missing in authserver :
On SSP side :
Log into SSPI root shellk edit deployment authserver -n nsxi-platform
Then search for the below line:
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
Edit the line and add the missing entity ID:
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
with this, the authserver should restart and sync the entity certificates from trust manager, then the auth issue on SSP side should get resolved. To validate, we can check if the certs in trust manager and auth server are in sync.
Issue 2 - For every new edge or esxi that gets added on NSX after SSP deployment:
Validate if the edge/TN node cert ( /etc/vmware/nsx/host-cert.pem) is present inside the SSP trust manager certs. Once we validate that, we should restart authserver pod to refresh authserver to get the certs.
Issue 3 - API cert on NSX Manager has changed after the SSP deployment :
Check certificate which is in use for SHA agent on NSX manager:NSX version is equal to or higher than 4.2:
Get the root certificate and node certificate via command `/opt/vmare/nsx-netopa/bin/sha-appctl -c get_napp_certificates`.
Get the root certificate and node certificate by searching syslog via command `zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile" `.It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps.
Get API certificate:
Get the certificate ID from the NSX UI:
a. Login to the NSX Manager UI, navigate to System > Certificates
b. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID)
c. Expand this certificate item, note its UUIDUse API GET /api/v1/trust-management/certificates/[certificate ID from last step]
curl -k -i -H "Accept: application/json" -u admin -X GET https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID>
{
"pem_encoded": "-----BEGIN CERTIFICATE-----XXXXXXXXXXXX-----END CERTIFICATE-----\n",
"has_private_key": true,
"used_by": [
{
"node_id": "<UUID>",
"service_types": [
"API"
]
}],
"leaf_certificate_sha_256_thumbprint": "XX:XX:XX:XX:XX:XX",
"resource_type": "certificate_self_signed",
"id": "<UUID>",
"display_name": "API certificate for node <UUID>",
"_create_time": 1690371634010,
"_create_user": "system",
"_last_modified_time": 1690377045873,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 2
}
If the node certificate is different from the updated API certificate, restart SHA agent via command `service nsx-sha restart`.
If restarting SHA agent does not work, please restart proton via command `service proton restart`.Issue 4 - TN certificate has changed after the SSP deployment:
Check node certificate which is in use for SHA agent on TN:
NSX version is equal to or higher than 9.0:
TN:
/usr/lib/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
Edge:/opt/vmware/nsx-netopa/bin/sha-appctl -c get_collector_status --collector_type napp
NSX version is lower than 9.0:
Get the root certificate and node certificate by searching nsx-syslog:
Copy all of the nsx-syslog files to a temporay directory and unzip the compressed logs
Search the log to dump the SSP file via command `grep -ia nsx-sha nsx-syslog* | grep -ia "NAPP Profile"`It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps.
Check the current node certificate on the TN:
cat /etc/vmware/nsx/host-cert.pem
If the node certificate used by SHA agent is different from the current effective certificate on the TN, restart the affected services:
SHA agent - /etc/init.d/netopad restart
Exporter - /etc/init.d/nsx-exporter restartLast step of remediation would be to restart proton (common agent leader node) , which will force full sync. Give a few minutes for common agent to full sync, and then restart auth server on SSP side. This will resync the cert from trust manager and will resolve the issue.
restart proton service on NSX :
Log on to the NSX Manager
systemctl restart protonrestart authserver on SSP side :
Log into SSPI root shell
k rollout restart deployment authserver
- UNAVAILABLE or DEADLINE_EXCEED:
- If there is extra firewall between TN node(including EDGE node and ESXi node) and NAPP platform, please check if there is policy denying the traffic from TN to NAPP, since the traffic requires HTTPS(443). Refer to https://ports.broadcom.com/home/VmwarevDefend for more details on needed open ports.
- Login to the NSX Manager managing this node, get the profile using API /api/v1/infra/sites/napp/registration,
Example:
{
"napp_registration_results": [
{
"cluster_id": "4ad2272c-####-####-####-##########e1",
"cluster_name": "NSX Application Platform",
"message_bus_ip_address": "192.xx.xx.xx",
"ingress_ip_address": "napp.example.com",
"status": "DEPLOYMENT_SUCCESSFUL",
"is_intelligence_enabled": false,
"is_metric_enabled": true,
"resource_type": "NappRegistration",
"id": "4ad2272c-####-####-####-##########e1",
"display_name": "4ad2272c-####-####-####-##########e1",
"path": "/infra/settings/napp/napp-appliance-info/4ad2272c-####-####-####-##########e1",
"relative_path": "4ad2272c-####-####-####-##########e1",
"parent_path": "/infra",
"remote_path": "",
"unique_id": "4b5268b8-####-####-####-##########4e",
"realization_id": "4b5268b8-####-####-####-##########4e",
"owner_id": "2313c7d2-####-####-####-##########e5",
"marked_for_delete": false,
"overridden": false,
"_create_time": 1709187097362,
"_create_user": "nsx_policy",
"_last_modified_time": 1709187320744,
"_last_modified_user": "nsx_policy",
"_system_owned": false,
"_protection": "REQUIRE_OVERRIDE",
"_revision": 1
}
]
}
- Check if the "ingress_ip_address" field in the API's response is same as the SSP FQDN,
- Check and fix the connectivity from "Reported by node" to SSP FQDN,
- Check if DNS lookup result for "ingress_ip_address" in the API's response is the real Security Services Platform (SSP) address on "Reported by node".
- If you are seeing a lot of these alarms from various Edges, ESX and UA nodes check if there is an alarm for "nsx_application_platform_communication.manager_disconnected". Follow the remediation for this alarm first.
- PERMISSION_DENIED:
- Check the API response from envoy log:
- Get the projectcontour pod name and get envoy log from this pod using following commands
- Log into SSPI root shell
- k get pods -n projectcontour
Example:
root@nsx-mgr-1:~# napp-k get pods -n projectcontour
NAME READY STATUS RESTARTS AGE
projectcontour-contour-77669b45bd-lqdx4 1/1 Running 0 9d
projectcontour-envoy-8qjqz 2/2 Running 0 9d
projectcontour-envoy-dgqc6 2/2 Running 0 9d
projectcontour-envoy-pssjw 2/2 Running 0 9d
projectcontour-envoy-pt9fk 2/2 Running 0 9d
projectcontour-envoy-qn79h 2/2 Running 0 9d
projectcontour-envoy-wchcd 2/2 Running 0 9d
- k logs {pod-name} -c envoy -n projectcontour
Example:root@nsx-mgr-1:~# k logs projectcontour-envoy-8qjqz -c envoy -n projectcontour
- In this log, user would get API response's flag.
Example:
[2024-05-31T08:55:21.662Z] "POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX 0 0 0 - "10.xx.xx.xx" "grpc-python/1.47.5 grpc-c/25.0.0 (linux; chttp2)" "57526c59-####-####-####-##########ae" "cloudnative.nsbucqesystem.net:443" "-"
- UAEX - stands for UnauthorizedExternalService
- If API response flag is UAEX, the possible reason is auth-server pod stops. To comfirm it, check auth-server pod status using following command:
- k get pods -n nsxi-platform | grep auth
Example:
root@nsx-mgr-1:~# k get pods -n nsxi-platform | grep auth
authserver-6bc9b59ddd-sbc5g 1/1 Running 0 7d4h- If auth-server pod stops, reach out to support for further investigation.