SRs might become active on an existing standby edge node after replacement
search cancel

SRs might become active on an existing standby edge node after replacement

book

Article ID: 435626

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

If both an old edge node and a new node are placed into maintenance mode manually prior to performing 'Replace Edge Cluster Member' action, it can lead to a network outage.

When these nodes exit maintenance mode manually, all SRs (Service Routers) may unexpectedly become active on the old edge node because this node is still a member of the edge cluster.

We can confirm from the /var/log/syslog.log on the old standby node that the SRs transition to an active state immediately upon exiting maintenance mode.

<DATE_TIME> <HOSTNAME> NSX #### - [nsx@#### comp="nsx-edge" subcomp="node-mgmt" username="root" level="INFO"] Updating maintenance mode to False

<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"] EdgeClusterConfig Message:
<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"]   edge-cluster-id    : <UUID>
<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"]   edge-node-id    : <existing standby edge node UUID>
<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"]   edge-fd-id    : <existing standby edge node UUID>
<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"]   edge-node-id    : <current active edge node UUID>
<DATE_TIME> <HOSTNAME> NSX # ROUTING [nsx@#### comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-db" level="INFO"]   edge-fd-id    : <current active edge node UUID>

<DATE_TIME> <HOSTNAME> NSX # FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO"] <UUID> transit from state Down to Standby event Node Up
<DATE_TIME> <HOSTNAME> NSX # FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO"] <UUID> transit from state Standby to Active event Node Up

 

Environment

VMware NSX

Resolution

Only the old edge node should be placed into maintenance mode based on the official documentation below.

https://techdocs.broadcom.com/us/en/vmware-cis/nsx/vmware-nsx/4-2/administration-guide/operations-and-management/replacing-an-nsx-edge-transport-node-in-an-nsx-edge-cluster/replace-an-nsx-edge-transport-node-using-the-nsx-manager-ui.html