Bringing up a new Service engine in the existing service engine group fails with error "unable to connect to controller"
search cancel

Bringing up a new Service engine in the existing service engine group fails with error "unable to connect to controller"

book

Article ID: 406842

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

In an Avi environment where both the Controller cluster and existing Service Engines (SEs) are already running on a patch version (e.g., 30.2.1-patch or 31.1.1-patch), deploying a new Service Engine into the same SE Group may fail during the initialization process.

During bootstrap, the new SE attempts to connect to the Controller to download the required patch image. However, the deployment stops with the error:

“unable to connect to controller”

This prevents the new SE from completing its upgrade from the base image to the patch version and results in the SE failing to join the SE Group.

Environment

Avi Load Balancer
Affected Versions 30.x: 30.2.1-2p1, 30.2.1-2p2, 30.2.1-2p3
Affected Versions 31.x: 31.1.1-2p1, 31.1.1-2p2

Cause

When a new Service Engine is deployed into an existing SE Group running on a patch version, it initially boots with the base image and then attempts to download and upgrade to the patch image.

If the Controller nodes are configured using FQDNs, the new SE may fail to correctly resolve these names to the expected internal hostnames (e.g., node1, node2, node3).
Because of this resolution failure, the SE cannot connect to the correct controller endpoint to retrieve the patch files.

This results in:

Inability to copy patch packages from the controller
Failure to upgrade to the patch version
SE deployment failing with “unable to connect to controller”

Steps to verify this issue:

Controller Node Name Configuration:

>> Check if the controller nodes are configured using hostnames or FQDNs.

SCP logs on the Service engine:

log location:

Service engine: /var/lib/avi/log/aviscp.INFO

In the SE logs, if the RemoteAddress is shown as “localhost” instead of node1/node2/node3, the SE is attempting to fetch patch files from an invalid source.

This indicates that hostname/FQDN resolution is not occurring as expected.

 AviSCP Client Command Line Args::
         + User                  : aviseuser
         + RemoteAddress         : localhost --> it should be node1/node2/node3
         + gRPC Port             : 5443
         

>> Verify that the controller address passed to the SE is not set to “localhost”.

These sample logs are collected from 30.2.1 version

 INFO [upgrade_se.print_info:115] SEUC:: Upgrade SE - Begin: se_args passed = Namespace(buffer_size=128, controller='localhost', fips_mode=False, image_path='/host/pkgs/30.2.1-9105-20240506.164825/se.pkg', patch=None, patch_image_path='/host/pkgs/30.2.1-9002-2p3-20240813.141105/se_patch.pkg', patch_version='2p3', reboot=True, retry=3, source=None, tag='30.2.1-9105-20240506.164825', timeout=3600)::SEUC

>> Service engine attempts to download the package will fail

 ERROR [upgrade_se.print_error:123] ^[[31mSEUC:: [upgrade_se.py] --- UPGRADE SE SCRIPT ERROR  Error: /host/pkgs/30.2.1-9105-20240506.164825/se.pkg scp_image_from_controller failed after 3 retries --- ::SEUC^[[0m 

 

Resolution

Workaround is not available for this issue

Upgrade to the latest version 30.2.4, 31.2.1, 31.1.2, 31.1.1-2p3 and to deploying the new Service engines