Upgrade to 30.2.x Aborted due to nginx service start failure

Products

VMware Avi Load Balancer

Issue/Introduction

The upgrade will start normally and will abort after the controller nodes reboot step and initial service start. The GUI and CLI will not be available for ~20 minutes, this is the timeout for the cluster quorum convergence.

In the upgrade coordinator log the last task you will find is "Waiting for cluster to be ready" before the abort operation.

/var/lib/avi/log/upgrade-coordinator.log

[2025-12-08 17:39:52,937] INFO [upgrade_task_mgr.run:] Running task WaitUntilClusterReadyLocally
[2025-12-08 17:39:52,943] INFO [upgrade_tasks.run:] UC::[Mon Dec  8 17:39:52 2025]Waiting for cluster to be ready.::UC

[2025-12-08 17:59:54,064] ERROR [upgrade_tasks.start:] ^[[31mUC::[Mon Dec  8 17:59:54 2025]Error while running task:WaitUntilClusterReadyLocally
Timeout: waited 1200 sec. and the cluster is still not active.
Traceback (most recent call last):
--SNIP--
avi.util.exceptions.TimeoutError: Timeout: waited 1200 sec. and the cluster is still not active.

[2025-12-08 18:00:04,140] INFO [upgrade_task_mgr.log_upgrade_details:] Running Upgrade Operation: UPGRADEAbort
[2025-12-08 18:00:04,156] INFO [upgrade_tasks.run:] UC::[Mon Dec  8 18:00:04 2025]Marked upgrade request to UPGRADE_FSM_ABORT_IN_PROGRESS.::UC
[2025-12-08 18:00:04,166] INFO [upgrade_task_mgr.run:] Running completed for task MarkUpgradeState_UPGRADE_FSM_ABORT_IN_PROGRESS with message: Success

When the controller rollback operation completes, the GUI/CLI will be available again. In the output of CLI command "show upgrade status" and "show upgrade status detail filter controller" you will find the upgrade aborted with error "Timeout: waited 1200 sec. and the cluster is still not active."

Example:

show upgrade status
+---------------+--------+---------------+---------------------------+--------------+-----------------------------+---------------------------------+--------+
| Name          | Tenant | Cloud         | State                     | Operation    | Image                       | Patch                           | Reason |
+---------------+--------+---------------+---------------------------+--------------+-----------------------------+---------------------------------+--------+
| example-cluster  | admin  | -             | UPGRADE_FSM_ABORTED       | UPGRADE      | 22.1.5-9093-20231010.145553 | 22.1.5-9013-2p1-20231130.213739 | -      |


show upgrade status detail filter controller
--SNIP--
|   state               | UPGRADE_FSM_ABORTED                                                              |
|   last_changed_time   | Mon Dec  8 18:06:17 2025 ms(5543494) UTC                                         |
|   reason              | UC::[Mon Dec  8 17:59:54 2025]Error while running task:WaitUntilClusterReadyLoca |
|                       | lly                                                                              |
|                       | Timeout: waited 1200 sec. and the cluster is still not active.                   |
|                       | Traceback (most recent call last):                                               |
---SNIP---
|                       | avi.util.exceptions.TimeoutError: Timeout: waited 1200 sec. and the cluster is s |
|                       | till not active.                                                                 |
|                       | .::UC                                                                            |

In the previous file system partition (upgraded image) you will find the postgresql service started after the nginx service which leads to a nginx configuration failure due to an unknown variable.

Nginx continuous error:

2025-12-08T17:34:01.876727+00:00 Avi-Controller nginx-service[637]: nginx: [emerg] unknown "req_id" variable
2025-12-08T17:34:01.876758+00:00 Avi-Controller nginx-service[637]: nginx: configuration file /etc/nginx/nginx.conf test failed

You can use the command 'cat /proc/cmdline' to identify the current file system partition. Avi moves between /host/root1 and /host/root2 for upgrades.

Pervious partition log (30.2.4) -- /host/root1/var/log/syslog

Postgresql service

2025-12-08T17:36:10.241535+00:00 Avi-Controller systemd[1]: Starting PostgreSQL...
2025-12-08T17:36:11.051773+00:00 Avi-Controller systemd[1]: Started PostgreSQL.
2025-12-08T17:36:21.327426+00:00 Avi-Controller systemd[1]: Stopping PostgreSQL...
2025-12-08T17:36:28.046616+00:00 Avi-Controller systemd[1]: postgresql.service: Succeeded.
2025-12-08T17:36:28.046871+00:00 Avi-Controller systemd[1]: Stopped PostgreSQL.
2025-12-08T17:36:28.401139+00:00 Avi-Controller systemd[1]: Starting PostgreSQL...
2025-12-08T17:36:29.203858+00:00 Avi-Controller systemd[1]: Started PostgreSQL.
2025-12-08T17:37:48.991466+00:00 Avi-Controller systemd[1]: Stopping PostgreSQL...
2025-12-08T17:37:55.720440+00:00 Avi-Controller systemd[1]: postgresql.service: Succeeded.
2025-12-08T17:37:55.720823+00:00 Avi-Controller systemd[1]: Stopped PostgreSQL.
2025-12-08T17:37:57.632171+00:00 Avi-Controller systemd[1]: Starting PostgreSQL...
2025-12-08T17:37:58.477465+00:00 Avi-Controller systemd[1]: Started PostgreSQL.
2025-12-08T17:39:45.384909+00:00 Avi-Controller systemd[1]: Stopping PostgreSQL...
2025-12-08T17:39:52.124321+00:00 Avi-Controller systemd[1]: postgresql.service: Succeeded.
2025-12-08T17:39:52.124596+00:00 Avi-Controller systemd[1]: Stopped PostgreSQL.

Postgresql logs

/host/root1/var/lib/postgresql/14/pg_log/postgresql-*.log

2025-12-08 17:37:57 UTC LOG:  database system was shut down at 2025-12-08 17:37:49 UTC
2025-12-08 17:37:57 UTC LOG:  database system is ready to accept connections
2025-12-08 17:39:45 UTC LOG:  received fast shutdown request
2025-12-08 17:39:45 UTC LOG:  aborting any active transactions
2025-12-08 17:39:45 UTC LOG:  background worker "logical replication launcher" (PID 4305) exited with exit code 1
2025-12-08 17:39:45 UTC LOG:  shutting down
2025-12-08 17:39:45 UTC LOG:  checkpoint starting: shutdown immediate
2025-12-08 17:39:45 UTC LOG:  checkpoint complete: wrote 2123 buffers (3.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.160 s, sync=0.011 s, total=0.189 s; sync files=763, longest=0.005 s, average=0.001 s; distance=10884 kB, estimate=10884 kB
2025-12-08 17:39:45 UTC LOG:  database system is shut down

Nginx service

2025-12-08T17:34:01.837310+00:00 Avi-Controller systemd[1]: Starting nginx http daemon...
2025-12-08T17:34:01.876727+00:00 Avi-Controller nginx-service[637]: nginx: [emerg] unknown "req_id" variable
2025-12-08T17:34:01.876758+00:00 Avi-Controller nginx-service[637]: nginx: configuration file /etc/nginx/nginx.conf test failed
2025-12-08T17:34:01.891695+00:00 Avi-Controller nginx-service[663]: nginx: [emerg] unknown "req_id" variable
2025-12-08T17:34:01.892602+00:00 Avi-Controller systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
2025-12-08T17:34:01.892693+00:00 Avi-Controller systemd[1]: nginx.service: Failed with result 'exit-code'.
2025-12-08T17:34:01.893135+00:00 Avi-Controller systemd[1]: Failed to start nginx http daemon.

Environment

Affects Version(s):

30.1.x

30.2.x

Cause

In 30.x new variables were added to the nginx configuration for logging. These variables are unknown to the nginx service and are dependent on a script that runs on the Avi controller to set the sites-enabled configuration for the portal.

/etc/nginx/nginx.conf

log_format timed_combined '$remote_addr [cache:$upstream_cache_status] $upstream_addr [$http_x_forwarded_for] - T-ID=$req_id $remote_user [$time_local] '

The setup_nginx script that configures the dependencies for the variables in the sites-enabled configuration will not execute until the postgresql service is UP, in this case the postgresql service was delayed and did not start within the clustering convergence timeout of the upgrade.

Resolution

This is a rare condition that is not likely to occur during upgrades. If this occurs in your system, there is usually a controller leader failover as a result which would clear any postgresql service issues between the nodes, if any. You may proceed to retry your upgrade again.

As a preventative measure you can perform a cluster warm reboot before an upgrade to 30.x.

Steps:

ssh to the controller leader node with the admin user
in the linux bash enter "shell" to launch the CLI

Execute the following commands in order:

terminal session_timeout 0
terminal unhide
reboot warm

If your upgrade fails again please open a support request with the VMware Avi Load Balancer support team.

Creating and managing Broadcom support request (SR) cases