Data corruption on one or all NSX managers in a cluster.
/var/log/corfu/corfu.9000.log shows error:<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file
OR error:
<Time-stamp>| ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
...
...
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
OR
/var/log/corfu/corfu-compactor-audit.log may show message "Tried to get layout from <node with corruption IP>:9000 but failed by timeout"When the Corfu server starts up, it loads binary data files and verifies checksums. If a data file is corrupted, Corfu cannot recover on its own. The cluster of Corfu nodes protects against this scenario. If Corfu data files are corrupted on a node, then the node needs to be removed and replaced with a new node.
If more than 1 NSX Manager node in a 3 node cluster is showing indications of Corfu corruption, please gather full NSX support bundles from all three NSX Managers by running the following command from the admin shell, copying the resulting files from the default location they will be written to (/image/vmware/nsx/file-store/), and then open a case with Broadcom Support for further assistance.
admin> get support-bundle file <nsx-manager-name>.tgz
If only one NSX Manager node shows signs of Corfu corruption, continue below.
Process to remove and replace corrupted Manager node:
admin> detach node <failed_manager_node_uuid>
admin> join <Manager-IP> cluster-id <cluster-id> username <Manager-username> password <Manager-password> thumbprint <Manager-thumbprint>
Alternatively:
If the failed NSX Manager was auto-deployed through the NSX UI, instead the corrupt NSX Manager can be deleted in the NSX Manager UI under System > Appliances and the Delete option for the corrupt node. Then a new NSX Manager node can deployed to return the cluster to 3 nodes.
Workaround:
If all 3 NSX Manager nodes Corfu data files are corrupted then the only recourse is to restore from valid NSX Backup. Reference documentation at Restore a Backup