AVI Load balancer service engines going in partitioned state
search cancel

AVI Load balancer service engines going in partitioned state

book

Article ID: 405030

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

AVI load balancer service engines are in a partitioned state.

Symptoms:

  1. Error on the SE from the controller UI:
    • State: Partitioned
    • Reason: Lost connectivity to Service Engine
  2. SE status from the controller shell:
    • Run the command "show serviceengine" and check the output is as below:
    • Oper State:
    • OPER_PARTITIONED
  3. In this scenario the service engine is reachable/able to attach to the controller but remain in the OPER_PARTITIONED
      • Log in to the controller leader shell.
      • Run the command below and ensure SE successfully attaches to the controller.
        • attach serviceengine <se_name>
  • Additionally, you may notice the below errors on the SE, but these error logs will only show up when the controller leader node becomes inactive during the time of the issue.

      • The /var/lib/avi/log/se_supervisor.log may contain the following errors.

        [2025-06-23 03:13:45,693] ERROR [se_supervisor.main:2814] Error in run: Could not get redis IP from cluster services watcher
        [2025-06-23 05:44:59,210] ERROR [se_supervisor.main:2814] Error in run: Could not get redis IP from cluster services watcher

      • The /var/log/syslog may show systemd errors related to the se_supervisor.service.

        Jun 23 05:45:06 Avi-se-#### systemd[1]: se_supervisor.service: Start request repeated too quickly.
        Jun 23 05:45:06 Avi-se-#### systemd[1]: se_supervisor.service: Failed with result 'signal'.
        Jun 23 05:45:06 Avi-se-#### systemd[1]: Failed to start Avi Service Engine Startup script.

Environment

  • VMware AVI Load Balancer
    • 30.1.2

Cause

  • A CPU soft lockup is the probable cause of the issue.
    • On the controller leader node, check if there are any CPU soft lockup in syslog.
    • Example:
      • root@##-##-##-##:/var/log# grep -i "soft lockup" syslog
        Jun 23 03:13:39 ##-##-##-## kernel: [9545082.074865] watchdog: BUG: soft lockup - CPU#4 stuck for 27s! [swapper/4:0]
        Jun 23 03:13:39 ##-##-##-## kernel: [9545082.074894] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [swapper/1:0]
        Jun 23 03:13:39 ##-##-##-## kernel: [9545082.074899] watchdog: BUG: soft lockup - CPU#2 stuck for 25s! [se_controller_i:897348]

Resolution

  • The issue has been fixed on the following versions:
    • 31.2.1
    • 30.2.5

  • Workaround:
    • Restart the se_supervisor service on the partitioned SEs.
    • Connect to the controller leader node shell.
    • Run the below command:
      • attach serviceengine <se_name>
      • sudo systemctl restart se_supervisor.service 

 

Note: If the issue persists, kindly create an SR with Broadcom support.