NSX Edge HA (for load balancer) split-brain issues when grouping objects configuration (used in load balancer) is modified.

Products

VMware NSX

Issue/Introduction

In NSX 6.4.1 or above, when you have the following configurations, the edge HA (with load balancer) enters a split-brain situation when the grouping objects setting (for load balancer) is modified.

NSX Edge are configured in High Availability (HA) mode.
NSX Edge are also configured as Load balancer (LB).
NSX Edge Load balancer (LB) pool members are configured with grouping objects (Security groups, IP sets, etc...).

Symptoms:

The High Availability (HA) channel between Edge VMs goes down as shown in [show service highavailability] command: show service highavailability

Highavailability Healthcheck Status:
This unit [0]: Up   Active: 1
Peer unit [1]: Up   Active: 0
Session via vNic_1: 10.1.1.1:10.1.1.2 Unreachable.

The Edge VM index -0 HA interface IP is not correct (switch to Edge VM index -1): show interface vNic_1 (Where vNic_1 is the Mgmt HA interface)

vNic_1    Link encap:Ethernet  HWaddr ##:##:##:##:##:BC
inet addr:10.1.1.2  Bcast:10.1.1.3  Mask:255.255.255.252
inet6 addr: fe80::250:56ff:feab:528b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:19223251 errors:0 dropped:261 overruns:0 frame:0
TX packets:17289853 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
RX bytes:2223722150 (2120.7 Mb)  TX bytes:3620139033 (3452.4 Mb)

In the Edge log, we see the NSX Manager sending a configuration (json file) with the instruction to reload the configuration: show log follow

2018-10-17T10:07:31+00:00 EdgeLBHA-0 config[]: [default]:  [daemon.info] INFO :: VseCommandHandler :: command json file:
..
2018-10-17T10:07:34+00:00 EdgeLBHA-0  [user.notice]
2018-10-17T10:07:34+00:00 EdgeLBHA-0 config[]: [default]:  [daemon.debug] DEBUG :: C_ServiceControl :: Checking status, op: unmonitor, service: syslog-ng, status: Not monitored
2018-10-17T10:07:34+00:00 EdgeLBHA-0 config[]: [default]:  [daemon.info] INFO :: C_ServiceControl :: Action unmonitor for syslog-ng done
2018-10-17T10:07:34+00:00 EdgeLBHA-0 config[]: [default]:  [daemon.info] INFO :: C_ServiceControl :: serverid: 0, state: 0
2018-10-17T10:07:34+00:00 EdgeLBHA-0 config[]: [default]:  [daemon.debug] DEBUG :: C_ServiceControl :: Send signal to reload, server: syslog-ng, signal: HUP, pids: 833
2018-10-17T10:07:34+00:00 EdgeLBHA-1 syslog-ng[833]: [default]:  [syslog.notice] Configuration reload request received, reloading configuration;
2018-10-17T10:07:34+00:00 EdgeLBHA-1 syslog-ng[833]: [default]:  [syslog.notice] Configuration reload finished;

From the above log extract, we see the hostname of the VM moving from index -0 to index-1.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX for vSphere 6.4.x

Cause

When the Grouping Objects configuration (Security groups, IP sets, etc.) used within load balancer Pool members is edited, the NSX manager pushes a faulty configuration to the Edge VMs. As a result, it will cause a split-brain scenario in the edge HA (of load balancer).

Resolution

This issue is resolved in VMware NSX Data Center for vSphere 6.4.4.

Workaround:
To workaround this issue, you need to follow either one of the below:

Disable HA on the Edge.
Force sync or redeploying the impacted Edge fixes the issue temporarily until the Grouping Objects configuration is edited.
- Force sync NSX Edge: https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.admin.doc/GUID-21FF2937-4CDF-491C-933E-8F44E21ED55E.html?hWord=N4IghgNiBcIGYHsBOBjApgAgM4E8B2KIAvkA
- Redeploy NSX Edge: https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.admin.doc/GUID-23F6CC61-7A16-4CEB-887A-9D56035A7EF4.html
Editing the HA configuration of the Edge : "Enable/disable HA log level", fixes the issue temporarily until the Grouping Objects configuration is edited.

Additional Information

Impact/Risks:

Split-brain scenario on the impacted Edges.
Possible data plane impact on the services running on the impacted Edges.