Recover Corfu cluster after data corruption on one or all cluster members.
/var/log/corfu/corfu.9000.log shows error:
2024-09-28T03:46:37.880Z | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error:
org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file.
Or error:
2024-09-28T03:46:37.880Z | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
/var/log/corfu/corfu-compactor-audit.log may show messages "Tried to get layout from <node with corruption IP>:9000 but failed by timeout"
VMware NSX-T
VMware NSX
When the Corfu server starts up, it loads binary data files and verifies checksums. If a data file is corrupted (possibly someone modified a file or there was a disk failure), Corfu cannot recover on its own.
The cluster of Corfu nodes protects against this scenario. If Corfu data files are corrupted on a node, then the node needs to be removed and replaced with a new node.
If more than 1 NSX Manager node in a 3 node cluster is showing indications of Corfu corruption, please gather full NSX support bundles from all three NSX Managers by running the following command from the admin shell, copying the resulting files from the default location they will be written to (/image/vmware/nsx/file-store/), and then open a case with Broadcom Support for further assistance.
admin> get support-bundle file <nsx-manager-name>.tgz
If only one NSX Manager node shows signs of Corfu corruption, continue below.
Process to remove and replace corrupted Manager node:
admin> detach node failed_node_uuid
admin> join <Manager-IP> cluster-id <cluster-id> username <Manager-username> password <Manager-password> thumbprint <Manager-thumbprint>
Alternatively:
If the failed NSX Manager was auto-deployed through the NSX UI, instead the corrupt NSX Manager can be deleted in the NSX Manager UI under System > Appliances and the Delete option for the corrupt node. Then a new NSX Manager node can deployed to return the cluster to 3 nodes.
Workaround:
If all 3 NSX Manager nodes Corfu data files are corrupted then the only recourse is to restore from valid NSX Backup. Reference documentation at Restore a Backup