Avi Controller UI/API inaccessible - Postgres Replication Failure

search cancel

Avi Controller UI/API inaccessible - Postgres Replication Failure

book

Article ID: 368658

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

The controller UI/API may become inaccessible and return an error "error": "Bad Service". Some of the cluster nodes may be stuck in "Starting state"

Run below command in CLI to check the nodes status.

// SSH to Controller Leader Node
# shell
> show cluster nodes

Also check cluster_manager.INFO for the below logs on Controller node to confirm if the issue is caused due to Postgres replication failure

// tail the cluster_manager.INFO logs from Leader Node

SSH to Controller Leader Node
# cd /var/lib/avi/log
# sudo -i
# tail -f cluster_manager.INFO

INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._internal_join:273] Replication file was not written in the window REPLICATION_TIMESTAMP_TIMEOUT, so cannot set replication complete

Environment

This issue may occur on all cloud environments

Cause

The Postgres replication may fail due to

Postgres db (Config or Metrics) got corrupted
Postgres did not get initialized correctly on one of the nodes
Replication not complete flag is present, when the pg_main_replication_not_complete or pg_metrics_replication_not_complete is still present in /var/lib/avi/etc/

Check if replication_not_complete file is written.


SSH to Controller Nodes
# sudo -i
# find / -name *replication_not* -type f 2>&1 | grep -v find

Resolution

Contact Broadcom Support for further assistance on this issue.

https://knowledge.broadcom.com/external/article/405686/how-to-create-a-wolken-case-for-avi-prod.html

Feedback

thumb_up Yes

thumb_down No