NSX Manager UI is inaccessible showing "This site can't be reached" after rebooting or restarting services.

Products

VMware NSX

Issue/Introduction

Symptoms:

VMware NSX 4.x lower than 4.1.2.x or 4.2.x is installed.
Certificate(s) of type CLIENT_AUTH are present on the NSX Manager.
The UI for an NSX Manager node will not open saying This site can't be reached with ERR_CONNECTION_CLOSED or something similar
Standard API requests to an NSX Manager node or the VIP are failing.
In the NSX-T Manager node log /var/log/proxy/envoy.log there is a warning message at the end similar to the following:

[warning][config] [source/common/config/filesystem_subscription_impl.cc:43] Filesystem config update rejected: Error adding/updating listener(s) https-node-v4-local: Failed to load trusted CA certificates from <inline>

Other potential things to look for:

Is it a Federated environment? VMware NSX Federation creates a PI account for connections between sites which use CLIENT_AUTH certificates
Are there other integrations that use CLIENT_AUTH certificates like Tanzu Kubernetes, for example?
During an NSX upgrade, the Management Plane may get stuck at 11% and stop progressing.

Environment

VMware NSX 4.x

Cause

A proxy management service in NSX fails to parse certificates properly if the certificate length is a multiple of 253, such as being 1012 characters in length, for example. This is an underlying issue related to the envoy reverse-proxy used on NSX Manager since VMware NSX 4.0.0.
Refer to Known Issue 3233914 in VMware NSX 4.1.1 Release Notes

Resolution

This issue is resolved in VMware NSX 4.1.2.x and 4.2.0.x and higher. The workaround below would provide a temporary solution until the environment can be upgraded.

Workaround

You can recover UI access by temporarily removing the /home/secureall/secureall/.store/.client_truststore file. This will clear the certificates currently loaded in /config/envoy/dynamic_listener_resources.json, which will quickly recover UI access until a reboot or service restart would cause the issue to recur. The following process can be run an NSX Manager CLI as root user (run st en if in admin):

1. Run ls -lah /home/secureall/secureall/.store/.client_truststore and note the current ownership and permissions of the file
2. Rename the .client_truststore file and add .bak to it to create a backup of the file and remove the original:
mv /home/secureall/secureall/.store/.client_truststore /home/secureall/secureall/.store/.client_truststore.bak
*Note that the ownership of the .bak file changes to root:root at this point.
3. Ensure that the UI is now working to connect directly to the same node. You can also test Postman with an API request at this point.
4. Once confirmed, restore the .client_truststore file:
cp /home/secureall/secureall/.store/.client_truststore.bak /home/secureall/secureall/.store/.client_truststore
5. Set the ownership (chmod) for the .client_truststore file so it matches the original as noted from step 1.
chown <ownername>:<groupname> /home/secureall/secureall/.store/.client_truststore
*Note that permissions should not have changed from rw-r----- . If needed, set it that way with:
chmod 640 /home/secureall/secureall/.store/.client_truststore
6. Verify that the UI and API still work for the node. At this point, external API request should work as well

For situations where CLIENT_AUTH certificates are not in use and can be removed, the following steps are available to prevent recurrence:

Identify any certificate which is preventing the reverse proxy service to start. You can use the following API to retrieve all certificates if API is working on any manager node in a cluster: GET /api/v1/trust-management/certificates.
*Alternatively, if standard API requests won't work, SSH to an NSX Manager node and enter engineering mode by running st en. Then, use this command from the CLI:
curl -H "x-nsx-username: admin" -X GET http://127.0.0.1:7440/nsxapi/api/v1/trust-management/certificates
Collect information about any certificates using the service type, CLIENT_AUTH, and with no nodes showing in the "Used by" section (should show nothing between the [ and ] symbols).
*If the certificate is in use, do not proceed. Open a support case and reference this KB article.
Check the length of any CLIENT_AUTH certificate, excluding "-----BEGIN CERTIFICATE-----" and "-----END CERTIFICATE-----", counting all characters in between these headers, except for newline characters, which appear as "\n".
For any certificate with a character count that is a multiple of 253, remove this certificate. You can use the following API as root user on the NSX Manager to delete the certificate:
curl -H "x-nsx-username: admin" -X DELETE http://127.0.0.1:7440/nsxapi/api/v1/trust-management/certificates/<cert-id>
*Where '<cert-id>' is the ID of the certificate which was identified as having a length of multiples of 253 and shows service type of CLIENT_AUTH.
After removing the certificate(s), the envoy service should be able to restart without running into the issue.

Note: If you are using Federation and the certificate is assigned to a PI account used by one of the sites, do not use the delete API above. Please follow the administration guide to replace the site certificate, this will automatically update the certificate used by the PI for that site.

Additional Information

Certificates that are of type CLIENT_AUTH may actually be in use due to integration with things like Tanzu Kubernetes or may be stuck due to NSX Manager failing to release a certificate of this type automatically. It is not a safe procedure to manually release a certificate and Engineering should be engaged prior to doing so. Where the certificates are actually needed, like with Kubernetes, removing the certificates from NSX Manager should not be done. The customer would need to upgrade to the fixed version and utilize the workaround above in the meantime.