After Upgrade to 4.2.1 alarm "The tier1 gateway<UUID> failover from Active to Down, service-router <UUID>" is thrown
search cancel

After Upgrade to 4.2.1 alarm "The tier1 gateway<UUID> failover from Active to Down, service-router <UUID>" is thrown

book

Article ID: 381909

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

This issue is specific to the NSX upgrade target version 4.2.  In 4.2 a code block was reintroduced.  This block is setting the T0 HA active active stateful to stateless.

The topology of T0 HA A/A stateful and T1 HA A/A stateful is supported. In this configuration the T1's fate is bound to the T0's fate. If the T0 fails then the flows will failover to the backup T1 Service Router (SR).  The T1 is active active stateful and  will look for the T0 to be stateful as well.  If it finds that the T0 is not stateful, it will down itself.  This will appear to the T1 that T0 has failed.  Now the second edge will handle all the flows and forwarding of packets.  This configuration requires that there be an even number of Edges in the edge cluster with a minimum of 2.

Important Concept Note:
Now, what if this SR goes down? The state associated to the flow must be saved continuously to the SR that will take over for the failed SR.  NSX automatically splits the edge cluster into subclusters, each sub-cluster comprising two edges. Edges in the same sub-cluster sync their state and are the backup to each other (VMWARE NSX ® REFERENCE DESIGN GUIDE Software Version 4.2,  Pg.138, Redundancy model ).  

Environment

NSX 4.2 

Cause

The reintroduced code block inadvertently is setting the T0 from stateful to stateless.  The T0 now appears to it partner T1 to bad.  The T1 while it does find the T0 cannot function with the T0 to track the flow states and thereby downs itself.  HA is now performing its designed function for this situation and is failing over to its backup Edge.  This is all by design as far as the failover.  The problem occurs because the same upgrade is also done to the backup edge.  The backup edge is also suffering from the code block changing T0 to stateless.

Note that the time stamp for these two T1 alarm is separated by milliseconds.  This is describing the HA action to failover to its backup.  The backup is also wanting to failover for the same reason.  Its T1 is downing itself since its T0 is also now stateless and thereby seen as down.
Neither edge is able to forward traffic with both T1 Service Routers now both in a down state.

This edge cli command output is showing the status of the T1.
get high-availability status

This confirms that the issue is with the T1 service router state.  This state is directly related to the T0 HA A/A now set to stateless instead of stateful. 
This is only seen with HA A/A stateful T0 and T1 upgrade to NSX 4.2.1


Resolution

A reboot of the Edges may allow the restoration of traffic forwarding.  This is due to T1 no longer looking for it T0 to be stateful.  It will work but not as it was originally. 
HA A/A stateful is no longer configured at this point.
Engineering is scheduling the code fix to be in VCF 9 release in 2025 TDB.
Workaround steps are being developed by engineering.


Additional Information