Adaptive Load Balancing bond fails if one of the member links goes down and then recovers
book
Article ID: 111976
calendar_today
Updated On:
Products
CA Privileged Access Manager (PAM)
Issue/Introduction
We have configured the PAM server with 2 NICs (Network Interface Cards), let us call them NIC_A and NIC_B, bonded in Adaptive Load Balancing mode.
When the communication through NIC_A breaks down, the communication from/to the PAM server continues through the NIC_B. This is the expected behavior.
However, when the communication through the failing NIC_A is recovered, the PAM server becomes non-accessible via network. The only way to recover from this situation is disabling and enabling again the one which has not failed, NIC_B, or reboot the entire PAM server.
Environment
Physical Appliances running PAM Server 3.x or above.
Cause
The network device Port Channel feature bundles individual links into a channel group to create a single logical link that provides the aggregate bandwidth of up to eight physical links. If a member port within a Port Channel fails, traffic previously carried over the failed link switches to the remaining member ports within the port channel.
The switch had the Port Channel feature enabled, so it assumed it had to manage the links for eventual disconnections, too.
So, the problem cause was that both, the PAM Server and the switch were trying to manage the bonding, resulting in a total link failure.
Resolution
Disable the Port Channel feature in the network devices for the links that are connected to the PAM appliances.
Additional Information
A similar situation occurs with bondings having more that 2 NICs.
In this case, all the NICs in the bonding except for one have to fail and the general failure accessing the PAM server will occur when all of them have recovered.
For instance, in a PAM server with 4 NICs, NIC_1, NIC_2, NIC_3 and NIC4, in a bond in Adaptive Load Balancing mode, there should be failures in NIC_1, NIC_2 and NIC_3 for the problem to occur. The PAM server communication failure will happen when the last one of the failing NICs got finally recovered, but not when just 1 or 2 of them had recovered.