SE stuck in PARTITIONED state due to gRPC/Envoy connection issue.
search cancel

SE stuck in PARTITIONED state due to gRPC/Envoy connection issue.

book

Article ID: 405354

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

 

  • The Service Engine (SE) deregistered due to a heartbeat (HB) failure, potentially caused by a network disruption or for other reason. After deregistration, the SE failed to re-register with the Controller and will remain in a Partitioned state.

 

  • In the /var/lib/avi/log/se_mgr.INFO log, below event can be found.
I0615 12:36:13.893563 995469 sm_svc_obj.cc:587] F[SaveState] [mb_rd_dmz-se-jtezg:se-d10ba9] SeEventHistory
:
ev: "2025-06-15 12:32:53.153764 UPD_CONSUMERS      7 7"
ev: "2025-06-15 12:36:13.887281 DEREGISTER         7 SE_DEREG_UNREACHABLE" 

 

  • Additionally, the SE fails to establish a gRPC connection with the Controller nodes due to Envoy-related connection termination errors and the same can be found on SE's  /var/lib/avi/log/se_supervisor.log
[2025-06-15 12:35:26,807] ERROR [se_cluster_services_client._watch:122] ^[[31mCLUSTER WATCHER: Subscribe with node3.controller.local failed with error <SE gRPC authentication failed:
upstream connect error or disconnect/reset efore headers. reset reason: connection termination>.^[[0m
[2025-06-15 12:35:46,526] ERROR [se_cluster_services_client._watch:118] ^[[31mCLUSTER WATCHER: Subscribe with node1.controller.local failed with gRPC error <<_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "upstream connect error or disconnect/reset before headers. reset reason: connection termination"
        debug_error_string = "UNKNOWN:Error received from peer ipv4 {created_time:"2025-06-15T12:35:46.526202452+00:00", grpc_status:14, grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection termination"}"

Cause

Envoy is terminating the gRPC stream with the remote Controller, likely because the Controller is either deadlocked or unresponsive due to an internal issue. As a result, it stops responding to connection attempts following the failure.

Resolution

Workaround: Restarting the Envoy service resolved the issue, as discussed. This action re-established the gRPC connection with the Controller.

Run the following command on the Service Engine (SE) to restart the Envoy service:

systemctl restart envoy.service