Security Manager process on an Avi Controller node crashes intermittently
search cancel

Security Manager process on an Avi Controller node crashes intermittently

book

Article ID: 394923

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

  • An event will be generated as shown below in case security manager process crashes. 
    Email alert:
    Summary: At 2025-02-25 16:12:19+00:00 (UTC) alert #controller# occurred on cluster-aef5d4e8-b582-4724-bf33-ceebe1b02b31 in tenant admin 
    Details: At 2025-02-25 16:12:19+00:00 (UTC) event CONTROLLER_SERVICE_FAILURE occurred on cluster-aef5d4e8-b582-4724-bf33-ceebe1b02b31 in tenant admin as Controller service security_mgr failed 
    node_name: node3.controller.local 
    service_name: security_mgr
  • From the security_mgr process logs, we can see below trace 
    runtime: goroutine stack exceeds 1000000000-byte limit
    runtime: sp=0xc28c4ee380 stack=[0xc28c4ee000, 0xc2ac4ee000]
    fatal error: stack overflow

Environment

Affected versions:

31.1.1,

<= 30.2.2,

22.1.x

 

Cause

  • Security manager service may crash if there's a change in Controller leader, and there's a loss of quorum and starts up again after the cluster change: 
    [2025-01-25 04:07:45,318] ERROR [cluster_quorum_manager.evaluate_membership:393] ^[[31m[QUORUM] [QUORUM_LOSS] Loss of Quorum
    [2025-01-25 04:09:49,105] INFO [cluster_quorum_manager.evaluate_membership:382] [QUORUM] [LEADER_CHANGE] Leader None has changed to node1.controller.local025-01-25 04:09:49,106] INFO [cluster_quorum_manager.evaluate_membership:398] [QUORUM] [MEMBERSHIP_CHANGE] Active nodes ['node3.controller.local', 'node1.controller.local', 'node2.controller.local'] Leader node1.controller.local
    panic: runtime error: index out of range [-1]
    goroutine 702 [running]:
    avi/utils.ring.FindResource(0x7eb36e8, 0x0, 0x0, 0xc001af8c00, 0x33, 0xc002a265c0, 0x0)
    Sat Jan 25 04:10:23 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
  • There could be a series of stack overflow at different intervals: 
    Mon Jan 27 06:30:55 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
    Wed Jan 29 07:31:48 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
    Fri Jan 31 07:49:19 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
    Mon Feb  3 13:45:29 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
    Sun Feb  9 02:20:04 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
    Fri Feb 14 14:45:32 2025 ====== PROCESS CLEANUP START: Module security_mgr uuid 005056828480-security_mgr:0 ======
  • These crashes are due to failure to login with a 502 from security manager's go client using the AVI SDK to create a client. Below are the portal-access logs to confirm that: 
    0220 06:00:29.665989    I  569696       session/avisession.go:746       Req for POST uri https://localhost/login tenant  RespCode 502
    0220 06:00:29.666373    I  569696       session/avisession.go:763       Retrying url: https://localhost/login; retry: 0 due to Status Code 502
  • At this point as the session creation API is stuck and the SDK is in infinite retry loop. From the logs - retry count = 0, so it seems to be outside the scope of maxAPIRetries.
  • This is a known issue on Avi 22.1.x version

Resolution