Avi Controller UI/API inaccessible - Postgres Replication Failure
search cancel

Avi Controller UI/API inaccessible - Postgres Replication Failure

book

Article ID: 368658

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

The controller UI/API may become inaccessible and return an error "error": "Bad Service".  Some of the cluster nodes may be stuck in "Starting state"



Run below command in CLI to check the nodes status.

// SSH to Controller Leader Node
# shell
> show cluster nodes

Also check cluster_manager.INFO for the below logs on Controller node to confirm if the issue is caused due to Postgres replication failure

// tail the cluster_manager.INFO logs from Leader Node

SSH to Controller Leader Node
# cd /var/lib/avi/log
# sudo -i
# tail -f cluster_manager.INFO

INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._wait_for_leader_to_join:243] Waiting for leader to join...
INFO [cluster_node_manager._internal_join:273] Replication file was not written in the window REPLICATION_TIMESTAMP_TIMEOUT, so cannot set replication complete

Environment

This issue may occur on all cloud environments

Cause

The Postgres replication may fail due to

  • Postgres db (Config or Metrics) got corrupted
  • Postgres did not get initialized correctly on one of the nodes
  • Replication not complete flag is present, when the pg_main_replication_not_complete or pg_metrics_replication_not_complete is still present in /var/lib/avi/etc/

Check if replication_not_complete file is written.


SSH to Controller Nodes
# sudo -i
# find / -name *replication_not* -type f 2>&1 | grep -v find

Resolution

Contact Broadcom Support for further assistance on this issue.

https://knowledge.broadcom.com/external/article/405686/how-to-create-a-wolken-case-for-avi-prod.html