NAPP 4.2.0 Upgrade Failure During Metrics Feature Upgrade - Site-Service DB Table Not Created During Upgrade

search cancel

NAPP 4.2.0 Upgrade Failure During Metrics Feature Upgrade - Site-Service DB Table Not Created During Upgrade

book

Article ID: 373017

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

During NAPP upgrade to 4.2.0, upgrade of the Metrics feature fails because the metrics-nsx-config component fails to come up. If other failures are observed during metrics upgrade, the source of the failure may be different and this article may not be applicable.

You can validate if the issue described in this document is applicable by performing the following.

Log into the NSX Manager and using napp-k command as below:

napp-k get pods | grep metrics-nsx-config
nsxi-platform metrics-nsx-config-5b997bb7c9-pwsbq 0/1 CrashLoopBackOff 25 (4m29s ago) 125m
nsxi-platform metrics-nsx-config-674fb6876d-dv4dg 1/1 Running 0 3h48m

Check the logs of the failing pod (i.e. in CrashLoopBackOff):

napp-k logs metrics-nsx-config-5b997bb7c9-pwsbq | grep 'Caused by: feign.FeignException: status 500 reading SiteServiceApi'
Caused by: feign.FeignException: status 500 reading SiteServiceApi#getAllSites() <<- indicates issue is with site-service

Additionally, we can check the site-service logs:

napp-k logs deploy/site-service | grep 'Error migrating DgsSites table'
2024-07-20T08:37:31.45241032Z stdout F 2024-07-20T08:37:31.452Z    ERROR    repository/repository.go:14    Error migrating DgsSites table    > {"error": "FATAL: terminating connection due to administrator command (SQLSTATE 57P01)"}

If no matches are found in the logs, then the issue is likely something else.

Environment

NAPP 4.2.0

Cause

The site-service component waits until postgres is running before starting. When it starts up, it will connect to postgres, and
subsequently create/migrate table into the configuration database. However, if migration fails (e.g. postgres restarts during
the migration), it is not retried; a restart of site-service is required. Since the table is not created in postgres, downstream applications that rely on this table will fail.

Resolution

Log into the NSX Manager. All commands in subsequent steps are run from the NSX Manager.

Verify postgres pods are up and running:

napp-k get pods | grep postgres
 
nsxi-platform        postgresql-ha-pgpool-7f6c57ffc5-c6vqb     1/1     Running     0     142m  
nsxi-platform        postgresql-ha-pgpool-7f6c57ffc5-dm965     1/1     Running     0     143m  
nsxi-platform        postgresql-ha-postgresql-0                1/1     Running     0     143m

All three pods from the above output should have output similar to: '1/1 Running'

Copy the script from the KB to the NSX manager. It must not be stored in the /tmp directory. The following steps assume the script is named kb_upgrade_fix.sh
Store the contents of this command to a file as a precautionary backup: napp-k get job load-default-site -o yaml > load-default-site-backup.yaml
Make the script executable: chmod +x kb_upgrade_fix.sh
Run the script ./kb_upgrade_fix.sh, the script will print "script executed successfully, please restart pods and retry upgrade" if everything succeeded
Restart the failing pod: napp-k rollout restart deploy/metrics-nsx-config
User should now be able to complete upgrade.

Attachments

kb_upgrade_fix.sh get_app

Feedback

thumb_up Yes

thumb_down No