Issue 1. The NSX_UA_TN missing in authserver . The NAPP authentication server is missing an entity ID for the transport node, in this case the transport node is an ESXi host. The missing ID leads to metric service being unable to authenticate and therefore fails to deliver metrics.
Issue 2. For every new edge or esxi that gets added on NSX after NAPP deployment
Issue 3. API cert on NSX Manager has changed after the NAPP deployment
This is a known issue impacting VMware NSX NAPP.
Workaround:
The certs added to authserver from trust manager after restart can be grepped from the following command.
example cmd :
napp-k logs authserver-<podname> | grep "NSX_UA_TN"
napp-k logs authserver-<podname> | grep "NSX_UA_EDGE"
We can query the certs present in trust manager using the following API. The NSX_UA_TN / NSX_UA_EDGE type cert should be present inside the result of GET call. In the result, the alias field represents the UUID of the TN/Edge node, which can be got by executing get node-uuid on TN node.
GET https://<NSX_MANAGER_IP>/napp/api/v1/platform/trust-management/certificates
Example TN node cert from trust manager get certs API call.
{
"uuid"
:
"45040503-xxxx-xxxx-xxxx-xxxxxxxx"
,
"alias"
:
"0af35bd1-xxxx-xxxx-xxxx-xxxxxxxx"
,
"pem_encoded"
:
"-----BEGIN CERTIFICATE-----\nMIIEEDCCAvgCCQC...sFADCXXXXXXshCSk\n-----END CERTIFICATE-----"
,
"used_by"
:
"NSX_UA_TN"
},
The following command can be used on the TN/Edge node to get the host cert which has to match from NAPP trust manager certs too.
cat /etc/vmware/nsx/host-cert.pem
We can get the UUID of the TN node using the following CLI command get node-uuid . This should match with the alias of the TN/Edge node from trust-manager get certs API call.
Example :
In TN node :
>> /bin/nsxcli -c get node-uuid
In Edge node :
>> su admin
>> get node-uuid
We can query the TN certs present in NSX manager using the following API.
GET "https://<NSX_MANAGER_IP>/api/v1/messaging/clients"
Common agent is the service which takes care of pushing the certs into trust manager on NAPP side. In order to check if common agent has synced properly we can check the logs of /var/log/proton/nsxapi.log around the time, the TN/Edge node was added.
Steps to identify the leader node of common agent service:
Figure out which of the 3 manager nodes, has common agent leadership role. The following command would give tell which node is common agent leader with node id.
1. su admin -c get clus stat verb | grep "COMMON_AGENT_SERVICE"
2. To figure which node from the id , su admin -c get clus stat
napp-k edit deployment authserver
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE
--trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
Note: There should be 3 entity ID's: NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN
napp-k -n nsxi-platform edit deployment authserver
(If the napp-k alias is not functional, you can directly point to the kubernetes config file like: kubectl --kubeconfig /config/vmware/napps/.kube/config)
with this, the authserver should restart and sync the entity certificates from trust manager, then the auth issue on NAPP side should get resolved. To validate, we can check if the certs in trust manager and auth server are in sync.
Validate if the edge/TN node cert ( /etc/vmware/nsx/host-cert.pem) is present inside the NAPP trust manager certs (GET certs api). Once we validate that, we should restart authserver pod to refresh authserver to get the certs.
Last step of remediation would be to restart proton (common agent leader node) , which will force full sync. Give a few minutes for common agent to full sync, and then restart auth server on NAPP side. This will resync the cert from trust manager and will resolve the issue.
restart proton : systemctl restart proton
restart authserver on napp side : napp-k delete pod authserver-<podname>