Alerts not generated for certain events post controller reboot or cluster failover
book
Article ID: 406652
calendar_today
Updated On:
Products
VMware Avi Load Balancer
Issue/Introduction
When leader node fails and new leader node comes up, SE_UP and CONTROLLER_NODE_JOINED events happen. Alert_manager fails to trigger alerts for these events.
Environment
All Avi deployments are susceptible to this issue.
Cause
Leader Failover:
When the leader node goes down, a new leader node is elected and starts its services.
Alert Manager Starts Querying:
The alert_manager component on this new leader node begins its job. It queries log_mgr, every 5 seconds to check for new events within a 5-second time window.
Log Manager Slow Initialization:
The log_mgr is slow to start up and become fully operational. It takes more than 5 minutes to initialize(index all log file), whereas the alert_manager expects it to be ready much sooner (within about 30 seconds).
Another issue with the log_mgr restart is that on every restart, the first cleanup function will be assigned only 4GB of disk quota but afterwards it gets updated to the designated quota based on the controller's resource.
This incorrect quota allocation will trigger a substantial clean up in scaled environments.
When the log sync service is restarted, it will try and get those files from the other nodes since they will have those files. Since the size is substantial, the data transfer time for the sync command is large. So when the alert_manager queries the log_manager for certain files, it may respond incorrectly and the alert_manager will eventually move on.
Events Missed:
The actual SE_UP and CONTROLLER_NODE_JOINED events occur shortly after the new leader started.
By the time the log_mgr finally finishes its lengthy initialization (after 5+ minutes) and is ready to provide event logs, the alert_manager's query window has repeatedly reset and moved forward in time.
It's no longer looking at the time slot when those critical initial events happened, so the alerts for these events are missed.