Avi Cluster fails to converge when none of the nodes are able to assert leadership
search cancel

Avi Cluster fails to converge when none of the nodes are able to assert leadership

book

Article ID: 404553

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

  • Upgrade may fail if there were some Postgres replication delays prior to upgrade which resulted in "replication_not_complete" file to be created on controller nodes.
  • High availability of AVI load balancer's control plane is affected.

Environment

  • Avi deployments on versions 22.1.1-22.1.7-2p6, 30.1.x, 30.2.2-30.2.2-2p2.
  • All environments are susceptible to this issue.

Cause

  • Under certain leader failover scenarios, followers may fail to take up leadership.
  • Temporary network delays during database replication can cause 'replication not complete' files to be left on followers, even after the issue resolves. This may prevent them from assuming leadership in the event of a leader failure.
  • Replication_not_complete file created on the old network partition is copied to the new network partition causing upgrade to fail since none of the nodes can become the leader.

Resolution

  • If upgrade has failed, verify that ALL the nodes on the UI are showing up as "Active".
    • If the nodes are stuck in "Starting" or "Inactive", please reach out to support.
  • Once confirmed that all the nodes are "Active", we would need to delete the "replication_not_complete" files, if present, from any controller nodes before the next upgrade attempt.
  • You can use the commands below:
    # sudo -i
    # find / -name *replication_not* -type f 2>&1 | grep -v find
     
    If any file is found, just copy the entire string one by one and execute:
     
    # rm <output from above command>
  • You will need to perform these steps on all the cluster nodes. 
    • If you did not find such files on any controller nodes in the step above, the upgrade may have failed due to a different reason. Please reach out to support.
  • Once the files are cleaned up, you can re-attempt another upgrade and it should go through this time. 
  • This issue has been addressed in later releases. Please find the details below:
    • Bug ID: AV-218786
    • Details: Upgrade may fail if there were some postgres replication delays prior to upgrade which resulted in replication_not_complete file to be created on all node.
    • Fix Versions: 22.1.7-2p7, 30.2.2-2p3, 30.2.3+, 31.1.1
  • The issue being fixed ensures that upgrading from a fixed version will not fail because of this reason. However, when upgrading TO a version mentioned above, the upgrade can still fail.

Additional Information