The cluster experiences instability and crashes after a reboot or upgrade due to restore configuration failures.

search cancel

The cluster experiences instability and crashes after a reboot or upgrade due to restore configuration failures.

book

Article ID: 388596

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

The controllers will crash continuously, and the stack trace will display the following errors.
The stack trace file can be found at the following location on the controller: /var/lib/avi/archive/stack_traces/manage.py.20xx_xxx.stack_traces.

Traceback timestamp: 20230920_181346
Command line: /opt/avi/python/bin/portal/manage.py restore_datastore -w
Traceback (most recent call last):
  File "/opt/avi/python/bin/portal/manage.py", line 46, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.8/dist-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/opt/avi/python/bin/portal/nonportal/management/commands/restore_datastore.py", line 60, in handle
    self.restore_config_objs()
  File "/opt/avi/python/lib/avi/infrastructure/db2datastore.py", line 429, in restore_config_objs
    raise Exception('Failed to restore %d %s objects in %d seconds' %
Exception: Failed to restore 5 ServiceEngineGroup objects in 60 seconds

When you check the most recent backup, you may notice that some networks are listed in the 'Network' list but not in the 'NetworkRuntime' list, as shown in the example below.

[root@localhost]# jq -r '.Network[].uuid' scc-mgmt-avi_Default-Scheduler_20230920_170913.json
network-c7486410-cddf-434f-9270-7d377d3b58d7
network-713bc9ad-bffe-42d1-8104-048255b9f784
network-c6f2046b-d5a8-4569-ae94-59cc70ae89b3
network-17544996-1f05-408a-ab26-d777b3f804ac

[root@localhost]# jq -r '.NetworkRuntime[].uuid' scc-mgmt-avi_Default-Scheduler_20230920_170913.json
network-c7486410-cddf-434f-9270-7d377d3b58d7
network-17544996-1f05-408a-ab26-d777b3f804ac
network-c6f2046b-d5a8-4569-ae94-59cc70ae89b3

Cause

Issue was caused due to network inconsistencies in the network runtime objects of the DB.

Resolution

Temporary Workaround:

Remove Duplicate Network Portgroups:
- As a workaround, remove the duplicate network portgroups from the network, nwruntime, and vimgrnwruntime sections in the controller_config_backup.json file, where the controller_config_backup.json file is the name of the latest backup of the controller.
- After making these changes, use the updated controller_config_backup.json file to restore the configuration.
Cluster Recovery Steps:
- Power off the two controller nodes.
- Run clean_cluster.py with the --skip-se-reboot flag and restore the configuration using the updated backup file on one node.
- Once the leader node stabilizes, run clean_cluster.py on the remaining nodes and add them back to the cluster.

Permanent Fix:

Perform the upgrade to the following versions where the fix has been applied:

31.1.1

Feedback

thumb_up Yes

thumb_down No