The topology is Diameter client pod --> WorkerNode VM --> ESXi --> Edge VM1 Tier1 (where SNAT occurs) --> Edge VM 2 Tier 0 --> External Diameter Relay Agent
Initial diameter session establishment from all client pods to external Diameter Relay Agent is successful.
When a link failover occurs on one of the pod, it takes ~50 secs for next successful diameter session establishment for the client pods.
The SCTP INIT from worker nodes are not forwarded by SNAT on Tier-1 to external peer after a link failover in pods
The SNAT configured is as below where Source IP is set to Any Destination IP : Diameter Relay Agent IP with one Translated IP. (As 3 DRA were present, 3 different SNAT rules were configured).
4.2.1.3.0.24533894
The reason is only when the "session/connection" in the Edge VM corresponding to the old SCTP session is deleted, the connection from new IP is successful.
The default timeout to delete the old inactive SCTP connection is 30 secs for non TCP/UDP/ICMP protocols. So it takes ~50 secs to establish a new SCTP session and there by successful diameter session establishment
The solution is to create a unique SNAT rule for every worker node translated to a dedicated translated IP, which is to create a 1:1 SNAT rule for Source IP : Translated IPs