Avi default or custom alerts suddenly stop working and notifications stop

Products

VMware Avi Load Balancer

Issue/Introduction

All alert configurations, default or custom, stop working which affects any configured notifications such as syslog or email.

Under Operations > Alerts > All Alerts will be empty.

In the alert notification logs from the leader controller node, there will find events sent from events manager for alerts suddenly stop, the logs will have an old timestamp. In this case, the events stopped on 04.14.2025, the logs were collected 06.24.2025.

File: /var/lib/avi/log/alert_notifications_debug*log*

/var/lib/avi/log]
└─$ zgrep 'alert_evns_mgr._subscribe' alert_notifications_debug.log | tail -n 4
[2025-04-14 23:28:30,532] DEBUG [alert_evns_mgr._subscribe:84] Received: [report_timestamp: 7493314800064490512
[2025-04-14 23:28:38,595] DEBUG [alert_evns_mgr._subscribe:84] Received: [report_timestamp: 7493314834424271156
[2025-04-14 23:28:48,781] DEBUG [alert_evns_mgr._subscribe:84] Received: [report_timestamp: 7493314885964127964
[2025-04-14 23:28:52,611] DEBUG [alert_evns_mgr._subscribe:84] Received: [report_timestamp: 7493314898849120102

You can also correlate with timestamps with the last alerts created for any alert configurations in the system.

File: /var/lib/avi/log/alert_notifications_debug*log*

/var/lib/avi/log]
└─$ zgrep -ih 'save' alert_notifications_debug*log* | grep -v "nosavealert" |sort  | grep 'alert_manager' | tail -n 6
[2025-04-14 23:20:04,833] INFO [alert_manager.saveAlertToDb:963] Saved Alert to DB: System-Controller-Alert-00505698a202-1744672802.911488-1744672802-27791154 is created 1 obj alert-31a98435-da66-4739-a1f7-84dc11f2340d
[2025-04-14 23:20:04,836] INFO [alert_manager.raiseAlertTask:1069] saved alert System-Controller-Alert-00505698a202-1744672802.911488-1744672802-27791154 with uuid alert-31a98435-da66-4739-a1f7-84dc11f2340d
[2025-04-14 23:20:04,846] INFO [alert_manager.saveAlertToDb:963] Saved Alert to DB: Custom-Controller Alert-00505698a202-1744672802.911488-1744672802-13630715 is created 1 obj alert-005a2a4f-fbea-4968-9120-9b631cb3c0a0
[2025-04-14 23:20:04,849] INFO [alert_manager.raiseAlertTask:1069] saved alert Custom-Controller Alert-00505698a202-1744672802.911488-1744672802-13630715 with uuid alert-005a2a4f-fbea-4968-9120-9b631cb3c0a0
[2025-04-14 23:20:04,876] INFO [alert_manager.saveAlertToDb:963] Saved Alert to DB: System-Controller-Alert-00505698a202-1744672802.912353-1744672802-40134650 is created 1 obj alert-9ffc2268-1672-46fe-94d3-10660c89a6dd
[2025-04-14 23:20:04,877] INFO [alert_manager.raiseAlertTask:1069] saved alert System-Controller-Alert-00505698a202-1744672802.912353-1744672802-40134650 with uuid alert-9ffc2268-1672-46fe-94d3-10660c89a6dd

In the event manager logs (follower controller node) you will find a large amount of events with error "delay more than 128 seconds." The timestamps with the log messages from the alert notifications can be correlated and will be within the same timeframe.

File: /var/lib/avi/log/event_manager*INFO*

/var/lib/avi/log]
└─$ zgrep 'event_manager_streamer' event_manager.INFO  | grep 'Event with' | tail -n 4
2025-04-14T23:27:12.069Z	E  5095  	eventmanager/event_manager_streamer.go:219	Event with ReportTimestamp 7493314258898989891, event_id CONTROLLER_SERVICE_FAILURE and obj type CLUSTER delayed by more than 128 seconds.
2025-04-14T23:27:12.069Z	E  5095  	eventmanager/event_manager_streamer.go:219	Event with ReportTimestamp 7493314215948661792, event_id CONTROLLER_SERVICE_FAILURE and obj type CLUSTER delayed by more than 128 seconds.
2025-04-14T23:27:12.170Z	E  5130  	eventmanager/event_manager_streamer.go:219	Event with ReportTimestamp 7493314344797716345, event_id CONTROLLER_SERVICE_FAILURE and obj type CLUSTER delayed by more than 128 seconds.
2025-04-14T23:27:12.170Z	E  5130  	eventmanager/event_manager_streamer.go:219	Event with ReportTimestamp 7493314387747578901, event_id CONTROLLER_SERVICE_FAILURE and obj type CLUSTER delayed by more than 128 seconds.

Environment

Affects Version(s):

30.1.x, 30.2.1–30.2.3, 31.1.1

Cause

This has been identified event manager can get into a deadlock and stop streaming events to alert manager which stops raising alerts and not recover.

Resolution

Please upgrade the system to the fix version.

Bug ID: AV-242168

Fix Version: 30.2.4, 31.1.2, 31.2.1

Workaround(s):

Change the knob "alert_manager_use_evms" from controller_properties to use Log Manager instead of Event Manager, then restart the avipythoncontroller service (systemctl restart avipythoncontroller) on Controller Leader Node.

ssh to the controller leader controller node with the admin user

Change the know - "alert_manager_use_evms" from controller_properties

[admin:]: > show controller properties | grep alert
| alert_manager_use_evms                     | True              |

[admin:]: > configure controller properties
[admin:]: controllerproperties> no alert_manager_use_evms
[admin:]: controllerproperties> save

[admin:]: > show controller properties | grep alert
| alert_manager_use_evms                     | False              |

Execute the following command to restart the service:
```
sudo systemctl restart avipythoncontroller.service
```

Note:

With Log Manager, event-based alerts may experience a slight delay of around 5–10 seconds under normal conditions. In rare scenarios, such as during cluster leader transitions, controller node reboots, or alert_manager service restarts, this delay could extend up to 90 seconds. In such instances, a few alerts may not be delivered.
This behavior Log Manager is expected and does not impact the system’s ability to process or log events; it only affects the immediacy of alert notifications.
The knob setting will persist, a node restart will not revert it to default.
The workaround functions correctly if alert configurations without AND conditions; otherwise, it may lead to alert_manager restarts.