book
Article ID: 318282
calendar_today
Updated On:
Issue/Introduction
Symptoms:
Network connectivity issue.
Relevant log lines:
192.168.x.21 Corfu is retrying the restore workflow the third time and succeeds, this should never happen:
2021-11-21T05:16:41.840Z | WARN | DetectionWorker-1 | o.c.r.v.w.WorkflowRequest | WorkflowRequest: Retrying RestoreRedundancyMergeSegments 192.168.x.21:9000
Starting a workflow
2021-11-21T05:16:42.060Z | DEBUG | orchestrator-4 | o.c.i.o.Orchestrator | run: Started action RestoreRedundancyAndMergeSegments for workflow 1xxxxxx2-3xxd5-4xxb-8xxe-7xxxxxxxxxx2
2021-11-21T05:16:42.067Z | INFO | orchestrator-4 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: Layout before transfer: Layout(layoutSe
rvers=[192.168.x.22:9000, 192.168.x.23:9000, 192.168.x.21:9000], sequencers=[192.168.x.22:9000, 192.168.x.21:9000, 192.168.x.23:9000],
segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=2479682527, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000])]),
Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=2479682527, end=2479685249, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000])]),
Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=2479685249, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.22:9000])])], unresponsiveServers=[192.168.x.21:9000], epoch=7460, clusterId=2xxxxxx6-axx7-4xx1-9xxe-9xxxxxxxxxxb)
Since the workflow has started at the previous epoch (7459), it only restores the first segment while being at epoch 7460:
2021-11-21T05:16:45.714Z | INFO | orchestrator-4 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: New layout: Layout(layoutServers=[192.168.x.22:9000, 192.168.x.23:9000, 192.168.x.21:9000], sequencers=[192.168.x.22:9000, 192.168.x.21:9000, 192.168.x.23:9000], segments=[Layout.LayoutSegment(replicationMode=CHAIN_R
EPLICATION, start=0, end=2479682527, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.21:9000])]), Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, star
t=2479682527, end=2479685249, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.21:9000])]), Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=24796
85249, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.x.23:9000, 192.168.x.22:9000])])], unresponsiveServers=[192.168.x.21:9000], epoch=7460, clusterId=2xxxxxx6-axx7-4xx1-9xxe-9xxxxxxxxxxb).
2021-11-21T05:16:47.006Z | INFO | orchestrator-5 | RestoreRedundancyMergeSegments | State transfer on 192.168.x.21:9000: Restored.
State transfer succeeds, consensus succeeds and we have an illegal layout 7461:
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": 2479685249,
"stripes": [
{
"logServers": [
"192.168.x.23:9000",
"192.168.x.21:9000" <--- This should never be here
]
}
]
},
{
"replicationMode": "CHAIN_REPLICATION",
"start": 2479685249,
"end": -1,
"stripes": [
{
"logServers": [
"192.168.x.23:9000",
"192.168.x.22:9000"
]
}
]
}
],
"unresponsiveServers": [
"192.168.x.21:9000" <--- If it's in here
],
"epoch": 7461,
Failure detector notices that 192.168.x.21 is in the unresponsive list
2021-11-21T05:17:01.299Z | INFO | DetectionWorker-1 | o.c.i.RemoteMonitoringService | Handle healing. Failure detector state: {"localNode":"192.168.x.21:9000","gr
aph":{"graph":[{"endpoint":"192.168.x.21:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0},{"
endpoint":"192.168.x.22:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0},{"endpoint":"192.168.x.23:9000","type":"CONNECTED","connectivity":{"192.168.x.22:9000":"OK","192.168.x.23:9000":"OK","192.168.x.21:9000":"OK"},"epoch":0}]},"action":"HEAL","healed":{"endpoin
t":"192.168.x.21:9000","numConnections":3},"layout":["192.168.x.22:9000","192.168.x.23:9000","192.168.x.21:9000"],"unresponsiveNodes":["192.168.x.21:9000"],"epoch":7461}
2021-11-21T05:17:01.300Z | DEBUG | client-20 | c.p.w.NettyCorfuMessageEncoder | encode: New max write buffer found 93
The healing workflow is ran on the illegal layout:
2021-11-21T05:17:01.452Z | INFO | orchestrator-6 | o.c.i.o.Orchestrator | run: Started workflow HEAL_NODE id dxxxxxx6-cxxe-4xx3-8xx8-5xxxxxxxxxxb
2021-11-21T05:17:01.452Z | DEBUG | orchestrator-6 | o.c.i.o.Orchestrator | run: Started action HealNodeToLayout for workflow dxxxxxx6-cxxe-4xx3-8xx8-5xxxxxxxxxxb
Environment
VMware NSX-T Data Center 3.x
VMware NSX-T Data Center
Cause
The data loss occurred due to the Edge case in the Corfu clustering reconfiguration. This leads to an illegal cluster layout in which the Corfu node is both in the unresponsive list AND in the first data segment. This illegal layout state consequently results in data loss of the Corfu data segment
Resolution
This is resolved in 3.1.x releases and newer.
Workaround:
- Shut down the node which is in the unresponsive list. (In the example above it's the node 192.168.x.21)
- Shut down the node which is not part of any segment. (In the example above it's the node 192.168.x.22)
- Deactivate cluster and shrink it down to the node that has the full redundancy, the node which is in every segment (In the example above it's the node 192.168.x.23)
- Restore the cluster from the node (192.168.x.23)
Additional Information
Impact/Risks:
Network outage