Vmware NSX Application Platform (NAPP) deployment fails at 80% while deploying metrics charts

search cancel

Vmware NSX Application Platform (NAPP) deployment fails at 80% while deploying metrics charts

book

Article ID: 345843

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

Deployment of the NSX-T Application Platform (NAPP) may encounter a failure at approximately 80% completion.
Next step is to check status of the pods by running the following commands as root on the CLI of the NSX manager by checking the pod status:

root@nsx-mgr-0:~# napp-k get pods | grep metrics
metrics-app-server-5dd579b8db-nn5bz               1/1     Running  
metrics-db-helper-558cb94688-gngv9                1/1     Running  
metrics-manager-5f5d48f49b-g4dn6                  1/1     Running  
metrics-manager-5f5d48f49b-npwl2                  1/1     Running  
metrics-manager-create-kafka-topic-k9655          0/1     Completed
metrics-nsx-config-5d6ccd7df8-2fz8b               1/1     Running  
metrics-nsx-config-create-kafka-topic-bwnjg       0/1     Completed
metrics-postgresql-ha-pgpool-795bf6b4c-6qzhj      1/1     Running  
metrics-postgresql-ha-pgpool-795bf6b4c-ml5f6      1/1     Running  
metrics-postgresql-ha-postgresql-0                0/1     Running  
metrics-postgresql-ha-postgresql-1                1/1     Running  
metrics-query-server-595b6d579d-ljtlx             1/1     Running

If any pod is stuck in the CrashLoopBackOff state, check the symptoms listed below and apply the provided remediation. If all pods are either Running or Completed, you can re-attempt the deployment from the GUI.

Environment

VMware NSX-T Data Center

Resolution

This is a known issue impacting NSX-T Data center.

Workaround:

1.1. metrics-postgresql-ha-postgresql are not in running state:

Check the status of the postgresql pods by running the following commands as root on the CLI of the NSX manager by checking the pod status:

root@nsx-mgr-0:~# napp-k get pods | grep metrics-postgresql-ha-postgresql

metrics-postgresql-ha-postgresql-0 0/1 CrashLoopBackOff
metrics-postgresql-ha-postgresql-1 1/1 Running

The example above illustrates metrics-postgresql-ha-postgresql-0 has crashed and metrics-postgresql-ha-postgresql-1 is running. Both can be in a CrashLoopBackOff state also.
The CrashLoopBackOff state can be confirmed by checking the log using the following command:

root@nsx-mgr-0:~# napp-k logs metrics-postgresql-ha-postgresql-0

If there are no errors at the end of the log, it would indicate the liveness probe restarted the pod.
You will see entries similar to "Cloning data from primary node..." at the end of the log.

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

1.1.1. Cause
The NAPP deployment is unable to complete due to the Postgres replication taking longer than expected. This causes the slave pods to be restarted by the liveness probe, before the replication can complete.

1.1.2. Remediation
To remediate the issue, increase the timeout value for the liveness and readiness probe. This can be done by editing the following file using the below steps ran from root of the NSX Manager CLI:

root@nsx-mgr-0:~# napp-k edit statefulset metrics-postgresql-ha-postgresql

Set the values to the below timers as follows if not already (note this editor operates using the same command structure as VIM):

livenessProbe:

..
initialDelaySeconds: >> Set to 90
periodSeconds: >> Set to 60
timeoutSeconds: >> Set to 30
...

readinessProbe:

...
initialDelaySeconds: >> Set to 30
periodSeconds: >> Set to 60
timeoutSeconds: >> Set to 30

Once set, metrics-postgresql-ha-postgresql-0 will return to a running state after a short period of time. Check all pods status with the command napp-k get pods | grep metrics-postgresql-ha-postgresql

1.2. metrics-postgresql-ha-pgpool pods are not in running state:

Before proceeding make sure you have checked Symptom1 section and metrics-postgresql-ha-postgresql pods are in running state.
Check the status of the pgpool pods by running the following commands as root on the CLI of the NSX manager by checking the pod status:

root@nsx-mgr-0:~# napp-k get pods | grep metrics-postgresql-ha-pgpool

metrics-postgresql-ha-pgpool-795bf6b4c-6qzhj 0/1 CrashLoopBackOff
metrics-postgresql-ha-pgpool-795bf6b4c-ml5f6 1/1 Running

The example above illustrates metrics-postgresql-ha-pgpool-795bf6b4c-6qzhj has crashed and metrics-postgresql-ha-pgpool-795bf6b4c-ml5f6 is running. Both can be in a CrashLoopBackOff state also.

1.2.1. Cause
The NAPP deployment is unable to complete due to the slow network or insufficient resources.

1.2.2. Remediation
To remediate the issue, increase the timeout value for the liveness and readiness probe. This can be done by editing the following file using the below steps ran from root of the NSX Manager CLI:

root@nsx-mgr-0:~# napp-k edit deployment metrics-postgresql-ha-pgpool

Set the values to the below timers as follows if not already (note this editor operates using the same command structure as VIM):

livenessProbe:
..
initialDelaySeconds: >> Set to 60
failureThreshold: >> Set to 10
...
readinessProbe:
...
initialDelaySeconds: >> Set to 30
failureThreshold: >> Set to 10

Once set, metrics-postgresql-ha-pgpool pods will return to a running state after a short period of time. Check all pods status with the command napp-k get pods | grep metrics-postgresql-ha-pgpool

1.3. metrics-manager pods are not in running state:

Before proceeding make sure you have checked above section and metrics-postgresql-ha-pgpool pods are in running state.
Check the status of the pgpool pods by running the following commands as root on the CLI of the NSX manager by checking the pod status:

root@nsx-mgr-0:~# napp-k get pods | grep metrics-manager
metrics-manager-5f5d48f49b-g4dn6 0/1 CrashLoopBackOff
metrics-manager-5f5d48f49b-npwl2 1/1 Running

The example above illustrates metrics-manager-5f5d48f49b-g4dn6 has crashed and metrics-manager-5f5d48f49b-npwl2 is running. Both can be in a CrashLoopBackOff state also.

1.3.1. Cause
The NAPP deployment is unable to complete due to the slow network or insufficient resources.

1.3.2. Remediation
To remediate the issue, increase the failureThreshold value for the startup probe. This can be done by editing the following file using the below steps ran from root of the NSX Manager CLI:

root@nsx-mgr-0:~# napp-k edit deployment metrics-manager

Set the following value to the specified field:

..
startupProbe:
...
failureThreshold: >> Set to 75

Once set, metrics-manager pods will return to a running state after a short period of time. Check all pods status by running the command napp-k get pods | grep metrics-manager

1.4. metrics-db-helper pod is not in running state:

Before proceeding make sure you have checked above section and metrics-manager pods are in running state.
Check the status of the pgpool pods by running the following commands as root on the CLI of the NSX manager by checking the pod status:

root@nsx-mgr-0:~# napp-k get pods | grep metrics-db-helper
metrics-db-helper-558cb94688-gngv9 0/1 CrashLoopBackOff

The example above illustrates metrics-db-helper-558cb94688-gngv9 has crashed.

1.4.1. Cause
The NAPP deployment is unable to complete due to the slow network or insufficient resources.

1.4.2. Remediation
To remediate the issue, increase the failureThreshold value for the startup probe. This can be done by editing the following file using the below steps ran from root of the NSX Manager CLI:

root@nsx-mgr-0:~# napp-k edit deployment metrics-db-helper

Set the following value to the specified field:
..
startupProbe:
...
failureThreshold: >> Set to 75

Once set, metrics-db-helper pods will return to a running state after a short period of time. Check all pods status by running the command napp-k get pods | grep metrics-db-helper

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Feedback

thumb_up Yes

thumb_down No