VMware NSX Application Platform (NAPP) deployment stuck at 70% registering platform

Products

VMware NSX

Issue/Introduction

Symptoms:

You are using NSX-T
During the installation of NSX Application Appliance (NAPP) the installation fails at 70% with the below errors:

In /var/log/proton/napps.log of the NSX Manager support bundle we see errors similar to the below 504 error:

2023-03-09 13:53:28 INFO api_request:115 [MainThread] - GET:/napp/api/v1/platform/monitor/platform/status
2023-03-09 13:53:43 INFO api_request:120 [MainThread] - b'upstream request timeout'
2023-03-09 13:53:43 ERROR api_request:133 [MainThread] - Unexpected error for GET /napp/api/v1/platform/monitor/platform/status, status: 504, body: b'upstream request timeout

Checking the var/log/proton/nsxapi.log on the NSX Manager support bundle we see errors for NAPP registration failed:

2023-02-01T16:15:00.708Z INFO http-nio-127.0.0.1-7440-exec-23 CloudNativePlatformFacadeImpl 11508 NAPP [nsx@6876 comp="nsx-manager" level="INFO" reqId="e681c515-####-####-####-########885" subcomp="manager" username="nsx-opsagent"] Get PlatformDeploymentProgress: DeploymentProgressStatusDto{overallStatus='DEPLOYMENT_FAILED', percentage='70', progressMessage='Registering Platform', errorMessage='[NSX Application Platform registration failed.]'}

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center

Cause

The issue can occur due to a connectivity issue between the NSX Manager and the pods.
To confirm connectivity, test the following:

Confirm NSX to K8 cluster connectivity:

Get the external ingress IP of the k8s cluster by running the below command from the root CLI of the NSX manager:

# napp-k get svc -n projectcontour
# This should display the following output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
projectcontour ClusterIP 192.x.x.x <none> 8001/TCP
projectcontour-envoy LoadBalancer 192.x.x.x 10.x.x.x 80:31434/TCP,443:31873/TCP

2. Ensure that the external IP address (10.0.8.3 in the above example) is reachable from the manager node:

# openssl s_client -debug -connect 10.x.x.x:443
connect: Connection timed out
connect:errno=110

3. if you get timeout like the above, it means there is an issue in your k8s network infra.

Check cluster-api to NSX Manager connectivity:
- Checking the log for the cluster API you see connection timed out errors like the below:

{"time":"2023-02-01T16:12:37.08024686Z","level":"ERROR","prefix":"-","file":"service.go","line":"426","message":"Fetching NSX config for populating intelligence default config failed: Unable to fetch platform deployment config: Get \"https://<nsx-manager>/policy/api/v1/infra/sites/default/napp/deployment/platform\": dial tcp 10.x.x.x:443: connect: connection timed out"}

The "nsx-manager" is a service in k8s that proxies call to policy manager. Please check if there are any connectivity issues from the cluster-api pod by executing this command from the NSX Manager shell:

napp-k exec -it `napp-k get pods | grep cluster | cut -d ' ' -f 1` -c cluster-api -- sh -c "curl https://<nsx-manager>/policy/api/v1/infra/sites/default/napp/deployment/platform --cert /certs/egress-tls.crt --key /certs/egress-tls.key -k"

* Trying 10.x.x.x...
* TCP_NODELAY set
* connect to 10.x.x.x port 443 failed: Connection timed out
* Failed to connect to <nsx manager> port 443: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to <nsx manager> port 443: Connection timed out
command terminated with exit code 7

In this example we confirmed the connection timed out.
Investigate why these components are unable to communicate (firewall, physical networking etc).

Resolution

The product is working as expected. This is a connectivity issue between the NSX manager and the K8 pods.