Health checks happening from standby SE even after health monitoring from standby SE knob is disabled in SE-Group

Products

VMware Avi Load Balancer

Issue/Introduction

Environment

Active/Standby SE-Group with health monitoring from standby SE knob disabled in SE-Group.

Cause

When a primary to standby SE switchover has taken place, we continue to send health-checks from the new standby SE (old primary).

To verify if the health-checks are being sent from the standby SE, please check the output of the following command

[admin:<cntlr>]: > show pool <pool-name> server hmonstat filter disable_aggregate se

If there are health-checks happening from the standby SE, we will see an active health monitor in the output of the above command

[admin:admin]: > show pool <pool-name> server hmonstat filter disable_aggregate se
+---------------------------------+------------------------------------------------------+
| Field                           | Value                                                |
+---------------------------------+------------------------------------------------------+
| last_transition_timestamp_3     | Mon Jun 10 14:38:51 2024 ms(135035) UTC              |
| last_transition_timestamp_2     | Mon Jun 10 14:38:51 2024 ms(126084) UTC              |
| last_transition_timestamp_1     | Sat Jun  8 05:18:28 2024 ms(968542) UTC              |
| server_hm_stat[1]               |                                                      |
|   server_name                   | <server-IP>:80                                       |
|   oper_status                   |                                                      |
|     state                       | OPER_UP                                              |
|   last_transition_timestamp_3   | Mon Jun 10 14:38:51 2024 ms(135025) UTC              |
|   last_transition_timestamp_2   | Mon Jun 10 14:38:51 2024 ms(126075) UTC              |
|   last_transition_timestamp_1   | Sat Jun  8 05:18:28 2024 ms(968555) UTC              |
|   shm_runtime[1]                |                                                      |
|     health_monitor_name         | System-TCP                                           |
|     health_monitor_type         | HEALTH_MONITOR_TCP                                   |
|     last_transition_timestamp_3 | Mon Jun 10 14:38:51 2024 ms(134990) UTC              |
|     state                       | 1                                                    |
|     rise_count                  | 35                                                   |
|     fall_count                  | 0                                                    |
|     total_checks                | 35                                                   |
|     total_failed_checks         | 0                                                    |
|     hm_initial                  | 0                                                    |
|     avg_response_time           | 1                                                    |
|     recent_response_time        | 1                                                    |
|     min_response_time           | 1                                                    |
|     max_response_time           | 2                                                    |
|     port                        | 80                                                   |
|     curr_failed_checks          | 0                                                    |
|   ip_addr                       | <server-IP>                                          |
|   port                          | 80                                                   |
| se_uuid                         | Avi-se-rdwly:se-04af9001-a5c7-49d6-8dfb-51aa6ebb2ff3 |
+---------------------------------+------------------------------------------------------+
+---------------------------------+------------------------------------------------------+
| Field                           | Value                                                |
+---------------------------------+------------------------------------------------------+
| last_transition_timestamp_3     | Mon Jun 10 14:39:09 2024 ms(68358) UTC               |
| last_transition_timestamp_2     | Sat Jun  8 05:17:01 2024 ms(623239) UTC              |
| last_transition_timestamp_1     | Sat Jun  8 05:17:01 2024 ms(613044) UTC              |
| server_hm_stat[1]               |                                                      |
|   server_name                   | <server-IP>:80                                       |
|   oper_status                   |                                                      |
|     state                       | OPER_DOWN                                            |
|     reason[1]                   | Marked down by System-TCP [Connection timed out]     |
|   last_transition_timestamp_3   | Mon Jun 10 14:39:09 2024 ms(68316) UTC               |
|   last_transition_timestamp_2   | Sat Jun  8 05:17:01 2024 ms(623199) UTC              |
|   last_transition_timestamp_1   | Sat Jun  8 05:17:01 2024 ms(613042) UTC              |
|   shm_runtime[1]                |                                                      |
|     health_monitor_name         | System-TCP                                           | <-- Active health monitor on the standby SE 
|     health_monitor_type         | HEALTH_MONITOR_TCP                                   |
|     last_transition_timestamp_3 | Mon Jun 10 14:39:09 2024 ms(68007) UTC               |
|     last_transition_timestamp_2 | Sat Jun  8 05:17:01 2024 ms(623129) UTC              |
|     state                       | 0                                                    |
|     rise_count                  | 0                                                    |
|     fall_count                  | 31                                                   |
|     total_checks                | 20694                                                |
|     total_failed_checks         | 32                                                   |
|     total_count[1]              |                                                      |
|       type                      | CONNECTION_TIMEOUT                                   |
|       count                     | 32                                                   |
|     curr_count[1]               |                                                      |
|       type                      | CONNECTION_TIMEOUT                                   |
|       count                     | 16                                                   |
|     hm_initial                  | 0                                                    |
|     avg_response_time           | 397                                                  |
|     recent_response_time        | 3999                                                 |
|     min_response_time           | 3999                                                 |
|     max_response_time           | 4001                                                 |
|     port                        | 80                                                   |
|     curr_failed_checks          | 32                                                   |
|   ip_addr                       | <server-IP>                                          |
|   port                          | 80                                                   |
| se_uuid                         | Avi-se-avuvg:se-bbdba4b8-3f94-4ea1-8714-8f036216b17c |
+---------------------------------+------------------------------------------------------+

Note: Having active health-checks from the standby SE will not cause any impact as they were not supposed to be enabled in the first place. But it will affect the health score calculation for the Pool/Virtual Service.

Resolution

This bug is going to be addresses in the next maintenance releases (TBA)

Workarounds:

1. Reboot the standby SE

OR

2. Toggle the knob health monitoring from standby SE under SE-Group config from False -> True -> False