T0 Logical Router, f2-tier0T0, has no Service Router (despite Firewall and NAT rules configured), and no Edge Nodes assigned as Active or Standby Nodes despite having an Edge cluster assigned

Products

VMware NSX

Issue/Introduction

Symptoms:
Network connectivity issue.

Relevant log lines:
192.168.x.21 Corfu is retrying the restore workflow the third time and succeeds, this should never happen:

2021-11-21T05:16:41.840Z | WARN | DetectionWorker-1 | o.c.r.v.w.WorkflowRequest | WorkflowRequest: Retrying RestoreRedundancyMergeSegments 192.168.x.21:9000

Starting a workflow
2021-11-21T05:16:42.060Z | DEBUG | orchestrator-4 | o.c.i.o.Orchestrator | run: Started action RestoreRedundancyAndMergeSegments for workflow 1xxxxxx2-3xxd5-4xxb-8xxe-7xxxxxxxxxx2

2021-11-21T05:16:42.067Z | INFO | orchestrator-4 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: Layout before transfer: Layout(layoutSe
rvers=[192.168.x.22:9000, 192.168.x.23:9000, 192.168.x.21:9000], sequencers=[192.168.x.22:9000, 192.168.x.21:9000, 192.168.x.23:9000],
segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=2479682527, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000])]),
Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=2479682527, end=2479685249, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000])]),
Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=2479685249, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.22:9000])])], unresponsiveServers=[192.168.x.21:9000], epoch=7460, clusterId=2xxxxxx6-axx7-4xx1-9xxe-9xxxxxxxxxxb)

Since the workflow has started at the previous epoch (7459), it only restores the first segment while being at epoch 7460:

2021-11-21T05:16:45.714Z | INFO | orchestrator-4 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: New layout: Layout(layoutServers=[192.168.x.22:9000, 192.168.x.23:9000, 192.168.x.21:9000], sequencers=[192.168.x.22:9000, 192.168.x.21:9000, 192.168.x.23:9000], segments=[Layout.LayoutSegment(replicationMode=CHAIN_R
EPLICATION, start=0, end=2479682527, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.21:9000])]), Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, star
t=2479682527, end=2479685249, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.21:9000])]), Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=24796
85249, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.22:9000])])], unresponsiveServers=[192.168.x.21:9000], epoch=7460, clusterId=2xxxxxx6-axx7-4xx1-9xxe-9xxxxxxxxxxb).

2021-11-21T05:16:47.006Z | INFO | orchestrator-5 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: Restored.

State transfer succeeds, consensus succeeds and we have an illegal layout 7461:

  "segments": [
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 0,
      "end": 2479685249,
      "stripes": [
        {
          "logServers": [
            "192.168.x.23:9000",
            "192.168.x.21:9000" <--- This should never be here
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 2479685249,
      "end": -1,
      "stripes": [
        {
          "logServers": [
            "192.168.x.23:9000",
            "192.168.x.22:9000"
          ]
        }
      ]
    }
  ],
  "unresponsiveServers": [
    "192.168.x.21:9000" <--- If it's in here
  ],
  "epoch": 7461,

Failure detector notices that 192.168.x.21 is in the unresponsive list

2021-11-21T05:17:01.299Z | INFO | DetectionWorker-1 | o.c.i.RemoteMonitoringService | Handle healing. Failure detector state: {"localNode":"192.168.x.21:9000","gr
aph":{"graph":[{"endpoint":"192.168.x.21:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0},{"
endpoint":"192.168.x.22:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0},{"endpoint":"192.168.x.23:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0}]},"action":"HEAL","healed":{"endpoin
t":"192.168.x.21:9000","numConnections":3},"layout":["192.168.x.22:9000","192.168.x.23:9000","192.168.x.21:9000"],"unresponsiveNodes":["192.168.x.21:9000"],"epoch":7461}
2021-11-21T05:17:01.300Z | DEBUG | client-20 | c.p.w.NettyCorfuMessageEncoder | encode: New max write buffer found 93

The healing workflow is ran on the illegal layout:

2021-11-21T05:17:01.452Z | INFO | orchestrator-6 | o.c.i.o.Orchestrator | run: Started workflow HEAL_NODE id dxxxxxx6-cxxe-4xx3-8xx8-5xxxxxxxxxxb
2021-11-21T05:17:01.452Z | DEBUG | orchestrator-6 | o.c.i.o.Orchestrator | run: Started action HealNodeToLayout for workflow dxxxxxx6-cxxe-4xx3-8xx8-5xxxxxxxxxxb

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

The data loss occurred due to the Edge case in the Corfu clustering reconfiguration. This leads to an illegal cluster layout in which the Corfu node is both in the unresponsive list AND in the first data segment. This illegal layout state consequently results in data loss of the Corfu data segment

Resolution

This is resolved in 3.1.x releases and newer.

Workaround:
- Shut down the node which is in the unresponsive list. (In the example above it's the node 192.168.x.21)
- Shut down the node which is not part of any segment. (In the example above it's the node 192.168.x.22)
- Deactivate cluster and shrink it down to the node that has the full redundancy, the node which is in every segment (In the example above it's the node 192.168.x.23)
- Restore the cluster from the node (192.168.x.23)

Additional Information

Impact/Risks:
Network outage